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FOREWORD 


This  document  is  the  second  of  two  Technical  Documentary  Reports 
prepared  for  the  Air  Force  Electronic  Systems  Division  as  part  of 
a  project  to  develop  better  techniques  for  estimating  the  costs 
of  computer  programming. 

The  first  volume  of  this  report  describes  the  work  done  to  identify 
and  to  organize  the  many  factors  that  affect  the  cost  of  computer 
programs.  Much  of  this  work  that  served  as  a  basis  for  the  quanti¬ 
tative  analysis  was  performed  during  the  summer  and  fall  of  1963 
under  the  sponsorship  of  DOD  Advanced  Research  Projects  Agency. 

The  quantitative  analysis  described  in  this  second  volume  began 
on  1  March  1964  under  the  sponsorship  of  ESD. 

The  two  volumes  of  this  TDR  bear  the  following  System  Development 
Corporation  document  numbers : 

Volume  I  -  .IM- 1447/000/02 
Volume  II  -  TM-l447/00l/00 


ABSTRACT 


Results  of  an  exploratory  analysis  aimed  at  deriving  better  cost-estimating 
relationships  for  computer  programming  development  are  presented.  Baaed 
upon  previous  work  that  hypothesized  an  initial  list  of  factors  affecting 
cost,  the  report  describes  the  steps  taken  to  collect  and  analyze  data  for 
the  purpose  of  supporting  or  rejecting  the  presumed  factors.  As  a  result, 
equations  that  estimate  costs  in  terms  of  such  resources  as  man  months  and 
computer  hours  have  been  derived.  Since  these  estimating  devices  were 
evolved  from  a  small  and,  perhaps,  unrepresentative  sample  of  programs, 
the  use  of  these  equations  is  not  recommended  for  actual  planning.  The 
study  concludes  that  multivariate  regression  analysis,  supplemented  by 
pertinent  judgment  and  intuition,  is  an  appropriate  tool  for  deriving 
cost-estimating  relationships.  To  arrive  at  more  useful  prediction 
equations,  recommendations  are  made  for  continuing  the  research.  These 
include  increasing  the  sample  size  and  improving  the  questionnaire  used 
to  collect  data.  The  basic  inputs  for  the  analyses,  the  actual  cost 
data,  representing  twenty-seven  program  development  efforts,  are 
included. 
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INTRODUCTION 


This  is  the  second  of  two  volumes  prepared  for  the  Air  Force  Electronic 
Systems  Division  (ESD)  as  part  of  a  project  to  develop  better  techniques 
for  estimating  the  costs  of  computer  programming.  The  general  project 
objective  is  to  conduct  research  and  analysis  aimed  at  developing  tools 
and  guidelines  for  both  managers  and  buyers  of  computer  programming 
products.  These  aids  are  intended  to  help  managers  improve  the  control 
and  planning  of  computer  program  development  by  providing  means  for 
lowering  costs,  shortening  lead  times,  and  improving  product  quality. 
Additionally,  the  long-range  results  of  this  work  are  intended  to  help 
buyers  compare  and  evaluate  computer  programming  products  on  a  systematic 
basis. 

Since  little  is  known  today  about  cost-estimating  relationships  for 
computer  program  development,  both  the  research  and  the  techniques  used 
to  conduct  it  have  been  exploratory.  The  work,  therefore,  must  be  iter¬ 
ative  in  nature.  Since  the  results  of  this  initial  analysis  have  not 
yielded  readily  useful  tools  for  managers,  the  major  emphasis  in  this 
report  is  on  the  approach  and  methods.  This  document,  therefore,  reports 
on  the  results  of  the  following  activities. 

.  Definition  of  cost  factors. 

Previous  work  was  reported  in  the  first  volume  of  this 
series. 

.  Collection  of  cost  data. 

A  questionnaire  was  designed  and  used  to  measure  the 
existence  of  presumed  cost  factors  in  a  number  of  program 
development  efforts. 

.  Formulation  of  a  prediction  model. 

A  linear  combination  of  cost  factors  with  appropriately 
assigned  weights  was  hypothesized  as  a  suitable  cost¬ 
estimating  model. 

.  Exploration  of  various  statistical  techniques  that  could  be 
used  to  develop  cost-estimating  relationships. 

The  techniques  explored  included  correlation  analysis, 
regression  analysis,  and  factor  analysis,  all  supplemented 
by  pertinent  judgment  and  intuition. 
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.  Evaluation  and  documentation  of  the  analysis  and  results. 

Evaluation  in  any  rigorous  sense  (e.g.,  cross-validation 
•with  another  data  sample  or  actual  use)  was  not  possible. 
However,  while  the  resulting  equations  are  not  recommended 
for  actual  use  in  development  efforts,  the  authors  would  be 
most  anxious  for  readers  to  use  these  equations  on  an  exper¬ 
imental  basis.  Their  reports  of  success  or  failure,  and 
reasons  for  deficiencies  in  the  equations  would  be  extremely 
valuable. 


Section  II,  Statement  of  the  Problem  identifies  and  discusses  the  management 
problem  of  computer  program  costing  in  the  context  of  cost  estimation  for 
automatic  data  processing  systems.  The  requirement  for  and  benefits  of 
accurate  cost  estimation  are  cited.  This  section  also  serves  to  define 
the  problem  addressed  by  the  study  as  that  of  deriving  an  initial  cost 
estimate  for  computer  program  development  and  does  not  address  the  problem 
of  costing  program  changes. 

Section  III,  Approach  and  Methods,  describes  the  exploratory  research 
that  constitutes  the  core  of  this  study.  The  technique  of  data  collection 
by  questionnaire,  as  used  in  this  analysis,  is  discussed,  with  emphasis  on 
some  of  the  problems  faced  by  the  investigators.  These  problems  center  on 
the  general  unreliability  and  unavailability  of  computer  programming  cost 
data. 

The  primary  analytical  technique  used  in  this  study  was  the  sequential 
application  of  linear  multivariate  regression  analysis,  supplemented 
heavily  by  pertinent  judgment  and  intuitive  analysis.  Other  techniques 
used  included  correlation  analysis  and  factor  analysis.  Statistical 
techniques  using  only  available  data  (survey  research)  often  suffer  from 
two  serious  problems;  both  are  encountered  in  this  study.  One  is  the 
lack  of  control  in  data  collection  which  results  in  less  than  optimum 
distribution  of  data  (e.g.,  gaps,  skewness);  and  the  second  is  simply 
an  insufficient  number  of  representative  observations.  Experience  with 
the  techniques  described  in  this  section  indicated  that  they  are  suffi¬ 
ciently  robust  to  supply  useful  cost-estimating  relationships.  If  more 
data  are  collected,  the  validity  and  confidence  one  may  place  in  the 
resulting  equations  will  be  increased. 

The  resulting  estimating  equations  are  described  in  Section  IV,  Summary  of 
Results.  Illustrative  formulas  are  shown  for  such  costs  as  man  months  and 
computer  hours,  and  product  characteristics  such  as  number  of  delivered 
program  instructions.  It  is  emphasized  that  the  formulas  are  not  suffi¬ 
ciently  valid,  have  large  standard  errors  of  estimate,  and  are  primarily 
illustrative  of  what  could  be  done  with  greater  quantities  of  data.  The 
effects  of  removing  three  extreme  data  points  are  treated  in  a  separate 
analysis  which  suggests  that  different  populations  may  be  necessary  to 
describe  computer  programming  development. 
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The  final  section,  Recommendations  for  Future  Work,  outlines  several 
additional  techniques  that  may  prove  valuable  in  further  analysis  and 
recommends  the  extensive  collection  of  data,  particularly  outside  of  SDC. 
This  is  necessary  to  increase  the  sample  size  and  to  eliminate  the  potential 
bias  introduced  by  examining  the  data  of  only  one  organization.  Other 
research  highly  pertinent  to  the  problem  of  cost  estimation  is  also 
recommended.  This  includes  work  on  techniques  for  estimating  program 
size  and  the  formulation  of  descriptors  and  measures  of  program  performance 
and  quality. 

To  keep  the  main  body  of  the  report  brief,  numerical  and  computational 
details  have  been  placed  in  the  appendices.  These  include  a  copy  of  the 
data  collection  questionnaire,  identification  of  all  the  variables  examined 
in  the  study,  the  responses  to  the  questionnaires,  the  correlation  of  each 
variable  with  cost  (validity  table),  the  results  of  a  preliminary  factor 
analysis,  and  a  summary  of  the  regression  analysis  for  each  derived 
equation.  The  appendices  contain  sufficient  information  to  allow  inde¬ 
pendent  investigators  to  repeat  any  part  of  the  study  or  to  continue  it 
in  other  desirable  directions.  In  fact,  the  compilation  of  cost  data 
included  in  this  report  is  felt  to  be  the  first  and  most  comprehensive 
collection  of  its  kind,  and  therefore,  a  valuable  resource. 


II.  STATEMENT  OF  THE  PROBLEM 


Background 

The  number  and  size  of  information-handling  systems  that  use  automatic 
data  processing  (ADP)  have  continued  to  grow  in  order  to  support  management 
and  operations  in  both  government  and  industry.  Despite  the  rapid  growth 
in  applications  of  ADP,  the  development  of  a  reliable  technology  with  a 
set  of  principles  and  techniques  for  managing  programming  efforts  is 
noticeably  absent.  Little  effort  has  been  devoted  to  the  collection  of 
data  on  past  experience  and  to  the  organization  of  these  data  into  a 
systematic  body  of  knowledge  for  managers.  Ad  hoc  groups  organized  to 
examine  ADP  applications  in  military  command  and  control  operations 
(e.g.,  the  Air  Force  Winter  Study  Group  in  1959  and  the  Institute  of 
Naval  Studies  Summer  Study  in  1961)  have  found  that  computer  program 
development,  in  comparison  to  equipment  development,  is  a  lagging 
technology. 

One  particularly  acute  problem  in  computer  program  development  is  that  of 
cost  estimation.  Recent  Congressional  hearings  concerning  the  federal  use 
of  electronic  data  processing  equipment  stress  the  need  for  "more  specific 
and  systematic  measures  of  cost."  General  Terhune  of  the  Air  Force  Electronic 
Systems  Division,  in  an  address  to  the  American  Federation  of  Information 
Processing  Societies  (Las  Vegas,  1963)  stated  that  "there  is  no  reliable 
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way  to  estimate  time  and  costs  of  initial  program  jobs.''  In  many  cases,  cost 
estimation  has  been  overly  optimistic;  in  others,  it  has  been  neglected  in 
planning;  as  a  result,  buyers  have  often  been  surprised  at  the  real  cost  of 
program  development.  In  addition  to  the  problem  of  initial  costing,  there 
is  a  need  to  develop  techniques  for  costing  changes  in  the  programming 
project.  Although  there  have  been  efforts  to  improve  the  prediction  of 
equipment  costs  and  lead  times,  similar  work  has  not  kept  pace  in  the 
programming  community. 

Another  important  and  related  problem  is  the  lack  of  measures  of  program 
performance  and  quality.  When  one  purchases  a  hardware  component,  some 
statements  (usually  quantitative)  concerning  its  performance  and  quality 
can  be  made.  As  a  result,  both  producer  and  buyer  have  a  means  toward  a 
common  understanding  of  the  relationship  between  price,  performance  and 
quality.  No  such  means  toward  a  similar  understanding  exists  for  the 
relationship  of  price,  performance  and  quality  in  computer  programming. 

The  disillusionment  of  buyers  and  users,  the  problems  faced  by  programming 
managers,  and  the  need  to  establish  more  accurate  and  meaningful  cost/value 
relationships  for  programs  and  their  development  have  led  to  the  present 
need  for  research  into  computer  programming  management. 

In  answer  to  this  need,  a  formal  research  project  to  investigate  problems  in 
programming  management  was  initiated  at  the  System  Development  Corporation 
in  1962  by  the  Advanced  Research  Projects  Agency  (ARPA).  This  project,  the 
Computer  Program  Implementation  Process  (CPIP)  project,  at  first  sought  to 
determine  whether  enough  similarity  existed  among  various  program  development 
efforts  to  permit  analysis  of  the  process  of  programming  in  a  systematic  way. 
In  the  early  project  work,  significant  similarity  in  programming  was  found 
in  terms  of  the  activities  that  constitute  the  programming  process  and  the 
problems  that  are  commonly  encountered.  To  reach  this  conclusion,  project 
members  collected  some  data  on  implementation  experience,  both  qualitative 
and  quantitative,  in  a  survey  of  development  efforts  by  SDC  and  other 
organizations.  The  quantitative  data  consisted  of  measures  of  product 
size,  such  as  number  of  pages  of  documentation,  number  of  program  instruc¬ 
tions,  and  costs  measured  in  man  months  and  computer  hours. 

After  identifying  a  broad  range  of  problem  areas,  CPIP  project  members  began 
a  more  detailed  investigation  of  the  factors  that  contribute  to  programming 
costs.  In  March  1 96b,  ESD  contracted  with  SDC  for  an  extension  of  the  cost 
analysis  to  include  detailed  cost  data  and  appropriate  statistical  analysis. 
This  report  is  the  result  of  that  work. 

The  Cost-Estimating  Problem 

This  study  was  undertaken  to  explore  various  techniques  to  derive  estimating 
relationships  for  the  costs  of  computer  programs.  Ry  costs,  we  mean  the 
resources  that  are  required  to  produce  a  program,  primarily  man  months  of 
programmer  time  and  computer  hours.  To  insure  that  results  would  be  useful 
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to  a  large  number  of  managers  in  different  organizations,  we  did  not  use 
dollars  as  a  cost  measure  because  they  tend  to  be  influenced  by  differential 
wage  rates,  computer  costs  and  overhead  charges. 

Good  cost  estimation  is  necessary  in  successful  computer  program  management 
for  many  reasons,  including  the  following: 

a.  Cost  estimates  serve  as  the  basis  for  budget-planning  decisions.  The 
computer  program  may  be  a  significant  element  in  the  total  cost  of  a 
command  and  control  system.  In  the  framework  of  cost  effectiveness,  a 
decision  to  select  one  system  over  another  may  be  strongly  influenced 
by  the  cost  of  programming.  Better  cost-estimating  techniques  would 
reduce  the  uncertainty  in  making  such  decisions. 

b.  Cost  estimates  are  used  for  resource  allocation  and  control.  Cost 
estimates  serve  as  a  guideline  (in  some  cases,  an  upper  limit)  for 
the  resources  that  are  allocated  to  the  work.  Within  these  limits, 
resources  are  apportioned  according  to  the  estimates  for  various  parts 
of  the  project.  While  the  programming  project  is  in  process,  the 
estimates  aid  in  controlling  resource  expenditure  and  reallocation. 
Thus,  accurate  estimates  will  improve  both  allocation  and  control. 

c.  Cost  estimates  are  used  for  evaluation.  Equally  important  to  the  direct 
uses  of  improved  cost  predictors  are  the  indirect  uses.  For  example, 
predictors  can  be  sought  that  relate  requirements  and  resources  to  the 
methods  used  to  control  costs.  With  such  predictors,  one  can  compare 
alternative  methods  and  staffing  policies  and  select  tools,  techniques, 
and  procedures  that  will  tend  to  reduce  costs. 

Granted  that  cost  estimation  is  important  in  programming,  how  is  this 
activity  now  being  performed?  At  the  start  of  a  project,  when  the  user 
and  the  program  developer  have  agreed  upon  the  gross  system  requirements, 
the  developer  estimates  the  amount  of  work  to  be  done  based  upon  (a)  the 
programs  and  procedures  that  have  to  be  designed,  implemented,  tested  and 
documented;  (b)  the  analysis  and  experiments  that  may  have  to  be  conducted; 
and  (c)  new  utility  programs  that  may  be  needed.  If  possible,  comparisons 
of  the  new  system  with  existing  systems  are  made  in  the  hope  of  finding  a 
cost- estimating  guideline.  A  first  estimate  is  made  for  the  resources  (men, 
machines,  facilities  and  travel)  required  to  do  the  work  in  the  scheduled 
time.  When  these  estimates  are  matched  against  their  availability,  the 
schedule  may  be  adjusted  accordingly.  In  addition,  alternate  proposals 
may  be  generated  to  reflect  trade-offs  between  scheduled  time,  system 
requirements  and  costs.  For  a  more  detailed  cost  analysis,  some  prototype 
tasks  may  be  completed  and  costed  to  determine  the  expected  level  of  com¬ 
plexity  and  nature  of  problems. 
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An  alternative  and  probably  more  frequent  approach  to  costing  is  to  estimate 
the  number  of  program  instructions  using  experience  with  similar  programs  as 
a  basis.  The  number  of  instructions  provides  an  intermediate  parameter  that 
is  then  converted  to  man  months  and  computer  hours  by  various  rules  of  thumb. 
Man  months  and  computer  hours  are  then  converted  to  dollars  by  multiplying 
by  average  expected  rates.  Finally,  funds  for  supporting  equipment,  supplies, 
office  facilities,  travel,  overhead  and  general  administration  are  added  to 
produce  the  total  cost. 

The  current  techniques  for  cost-estimating  are  not  very  accurate.  Projects 
frequently  require  more  resources  than  were  originally  estimated,  even  with 
ample  safety  factors  introduced.  Some  reasons  for  this  lack  of  success  are 
the  following: 

a.  Lack  of  agreement  on  terminology.  The  few  "standards"  that  do  exist  do 
not  contain  a  commonly  accepted  set  of  terms  to  describe  the  programming 
process,  the  programming  products,  and  the  personnel  involved  with  these. 

b.  Poor  definition  of  product  quality.  The  lack  of  standard  measures  of 
product  performance  and  product  quality  hampers  comparison  of  costs 
among  the  various  program  systems.  For  example,  the  common  use  of  a 
cost  per  instruction  to  compare  programs  does  not  recognize  that 
radically  different  quantities  of  resources  may  be  needed  to  develop 
two  programs  each  of  the  same  length  should  they  differ  in  complexity, 
language  used,  programmer  experience  level,  and  the  degree  to  which 
they  were  clearly  specified  at  the  start. 

c.  Poor  quality  of  cost  data.  Present  cost  collection  methods  are  not 
geared  to  accumulate  data  by  product  and  by  function  to  be  performed. 
Therefore,  costs  that  are  collected  by  various  organizations  are 
difficult  to  compare. 

d.  Nonquantitative  nature  of  many  factors  that  contribute  to  cost.  Program¬ 
ming  costs  are  strongly  influenced  by  many  factors  that  are  presently 
difficult  to  quantify  such  as  the  proficiency  of  the  programming  staff 
and  the  quality  of  management. 

Despite  these  difficulties,  this  project  was  undertaken  as  a  first  step,  in 
the  hope  that  estimating  the  costs  of  programming  products  could  be  made  a 
more  systematic  and  reliable  process. 

Scope  of  this  Project 


The  basic  problem  in  cost  estimation  is:  given  the  requirements  for  a 
computer  program,  what  types  and  quantities  of  resources  are  needed  to 
develop  such  a  program?  A  further  question  is  concerned  with  how  these 
resources  should  be  (or  more  realistically,  could  be)  applied  over  time, 
i.e.,  the  scheduling  problem.  One  difficulty  in  cost  estimation  is  that 
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the  relationship  between  requirements  and  cost  is  not  known.  That  is,  the 
present  ways  of  stating  requirements  cannot  be  readily  interpreted  in  terms 
of  the  work  to  be  done.  Further,  the  resources,  particualrly  the  programmers, 
cannot  be  characterized  in  such  a  way  as  to  predict  the  work  that  they  can 
do.  A  solution  of  the  problem  in  the  long  run  must  involve  finding  ways  to 
characterize  work  to  be  done,  requirements,  and  resources,  so  that  one  can 
be  translated  into  the  others. 

In  this  study  we  have  tried  to  relate  requirements,  resources,  and  certain 
indicators  of  management  practice  to  costs,  using  experience  data  and  statis¬ 
tical  techniques.  To  introduce  some  semblance  of  rigor,  we  defined  a  population 
(a)  by  limiting  the  scope  of  programming  activities  to  program  design,  code 
and  test  and  (b)  by  considering  for  purposes  of  comparison  the  concept  of  a 
"data  point"  defined  as  the  smallest  set  of  instructions 

.  whose  purpose  is  defined  by  someone  other  than  the  programmer, 

.  which  is  deliverable  to  the  user  (customer)  as  a  package,  and 

.  which  is  loaded  into  the  computer  as  a  unit  or  system  to  achieve 
the  stated  purpose. 

No  attempt  was  made  to  further  differentiate  programming  efforts.  The  many 
factors  of  programming  language,  programmer  experience,  and  complexity,  that 
may  be  used  to  explain  differences  in  cost  were  identified  as  independent 
variables  and  tested  for  their  significance  by  means  of  regression  analysis. 

We  addressed  the  problem  of  estimation  at  the  beginning  of  the  program 
development  effort  and  did  not,  therefore,  include  the  costing  of  program 
changes.  The  problem  of  costing  changes  is  one  worthy  of  a  thorough 
investigation.  When  changes  are  proposed,  some  experience  has  already  been 
accumulated  in  the  design  of  the  program  and  relationships  have  been  formed 
with  the  user,  whereas  at  the  beginning  of  the  project  the  estimator  has 
far  less  information.  In  estimating  the  cost  of  changes,  one  is  concerned 
with  both  the  additional  instructions  and  documentation  that  have  to  be 
prepared  and  the  n scrap"  instructions  that  must  be  discarded.  Also,  in  costing 
chnages,  one  must  consider  the  effects  of  the  change  upon  the  entire  program  system. 
Therefore,  the  costing  of  changes  will  usually  require  more  accuracy  and 
probably  include  more  detail  of  additional  factors. 

To  conduct  this  analysis,  we  gathered  data  by  questionnaire  for  twenty-seven 
completed  programs.  With  these  data,  we  developed  analytical  procedures 
using  tested  statistical  techniques.  Realizing  that  we  were  engaged  in  a 
search  process  and  that  our  sample  size  was  too  small  to  achieve  high  con¬ 
fidence  predictors,  we  aimed  for  two  results: 

a.  A  new  questionnaire  with  improved  ideas  on  data  to  be  collected 
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b.  Demonstration  that  the  approach  and  methods  used  could  lead  to  useful 
results 

Our  experience  with  respect  to  these  objectives  is  discussed  next. 


III.  APPROACH  AND  METHODS 


Introduction 

In  this  study,  we  used  multivariate  regression  analysis  as  the  basic  analytical 
tool.  The  mathematical  basis  and  theory  of  regression  and  correlation  analysis 
will  not  be  discussed  in  this  report.  Several  good  texts  are  listed  among  the 
references  (l),  (2)  and  (3).  Regression  analysis  techniques  have  been  used 
quite  successfully  to  derive  estimating  relationships  for  determining  the 
reliability  of  electronic  equipment  (4),  the  cost  of  overhauling  ships  (5), 
and  the  initial  cost  of  tooling  for  aircraft  production  (6).  To  our  know¬ 
ledge,  this  study  is  the  first  application  of  such  techniques  to  the  problem 
of  deriving  cost-estimating  relationships  for  the  development  of  computer 
programs. 

Since  our  statistical  analysis  and  the  associated  work  were  exploratory,  we 
felt  it  important  to  discuss  the  methods  and  techniques  used  and  to  review 
their  relative  success  in  some  detail.  We  have  included  the  problems  and 
procedures  of  both  the  data  collection  and  statistical  analysis.  This 
section  is,  therefore,  concerned  with  methods  only.  Data  and  results  of 
the  analysis  are  discussed  in  the  next  section  of  the  report. 

Data  Collection 


Design  of  the  Questionnaire.  The  organization  of  the  questionnaire  (see 
Appendix  I)  paralleled  the  organization  of  the  cost  factors  discussed  in 
Volume  I  of  this  study.  Each  of  the  six  categories  of  factors  comprised  a 
section  in  the  questionnaire.  This  organization  permitted  easy  separation 
of  the  questionnaire,  so  that  each  section  could  be  easily  delegated  to  the 
people  most  qualified  to  complete  it.  The  six  parts  of  the  questionnaire  were: 

1.  Operational  Requirements  and  Design 

2.  Program  Design  and  Production 

3.  Data  Processing  Equipment 

4.  Programming  Personnel 

5.  Management  Procedures 

6.  Development  Environment 
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The  first  two  parts  address  the  question,  "What  was  the  job  to  be  done?" 

The  next  two  ask,  "What  were  the  available  resources?"  and  the  last  two 
ask,  "What  was  the  nature  of  the  working  environment?" 

In  the  first  volume  of  this  series,  we  identified,  organized,  and  discussed 
about  50  factors  that  were  advanced  as  having  an  influence  on  the  cost  of 
computer  program  development.  The  first  task  in  this  project  was  to  reform¬ 
ulate  the  factors  so  that  they  could  be  quantified  in  the  program  development 
efforts  that  were  studied.  This  work  was  reflected  in  the  questionnaire,  the 
"instrument"  used  for  data  gathering,  i.e.,  the  presumed  cost  factors  became 
items  in  the  questionnaire,  and  later,  variables  in  a  statistical  analysis. 

The  skillful  design  of  the  data  questionnaire  is  a  vital  task  in  research 
of  this  kind,  for  it  is  on  the  basis  of  information  obtained  from  this 
instrument  that  the  validity  of  the  approach  rests.  To  construct  sound 
questionnaires,  certain  basic  principles  of  design  must  be  adhered  to. 

Three  of  the  most  useful  principles  are  reliability,  validity,  and  face 
validity.  In  designing  the  original  questionnaire,  reliability  and 
validity  were  somewhat  neglected,  whereas  face  validity  was  emphasized. 

Questionnaire  reliability"*"  can  be  viewed  as  the  consistency  with  which  a 
given  pattern  of  responses  is  obtained  from  replication  of  the  survey  to 
identical  or  alternate  respondents.  We  realized  that  poorly  structured 
items  present  opportunities  for  ambiguous  interpretations  and  inconsistent 
responses,  and  hence,  lower  the  overall  reliability  of  the  instrument. 
Although,  in  this  iteration,  we  treated  each  item  as  an  independent 
variable  in  the  analysis,  we  plan,  in  the  next  iteration,  to  explore 
aggregation  techniques  for  grouping  similar  items  into  indices.  This 
will  tend  to  improve  the  reliability  of  questionnaire  variables  and 
preserve  sources  of  cost  variance  that  might  otherwise  be  ignored. 

The  second  principle,  validity,  concerns  the  extent  to  which  the  variables 
will  predict  costs.  Statistical  techniques  are  available,  under  appropriate 
circumstances,  for  testing,  selecting  and  grouping  items  that  will  enhance 
overall  questionnaire  validity.  Validity  and  reliability  are  interrelated 
in  that  the  reliability  of  the  questionnaire  sets  a  limit  on  the  validity 
it  may  achieve.  Thus,  increasing  the  reliability  of  the  instrument  will 
tend  to  increase  its  overall  validity,  provided  the  items  remaining  in  the 
questionnaire  retain  their  individual  validities. 


^Reliability,  as  used  here,  concerns  the  phenomena  of  errors  or  differences 
in  measurement  obtained  when  a  characteristic  of  a  given  object  is  measured 
several  times  by  instruments.  This  useful  concept  has  been  widely  employed 
in  psychological  and  educational  fields.  In  the  physical  sciences,  reli¬ 
ability  of  measurement  is  usually  subsumed  under  the  alternate  topic, 
errors  of  observation. 
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The  third  principle,  face  validity,  is  essentially  the  meaningful  quality 
that  the  questionnaire  imparts  to  its  respondents.  Questionnaires  having 
items  with  good  face  validity  generally  tend  to  create  a  favorable  attitude 
among  respondents,  thereby  increasing  the  likelihood  of  reliable  responses, 
and  consequently,  allowing  the  inherent  validity  of  the  instrument  to  be 
achieved. 

Of  the  characteristics  mentioned  above,  the  one  on  which  the  most  emphasis 
was  placed  was  the  principle  of  face  validity,  i.e.,  meaningfulness  and 
answerability  of  questions.  To  insure  some  degree  of  consistency  in  the 
understanding  and  answering  of  the  questions,  it  was  necessary  to  define 
terms  within  the  body  of  the  questionnaire.  For  example,  such  words  as 
data  base,  instruction,  parameter  test,  innovation,  and  many  others  do  not 
enjoy  a  desirable  degree  of  standardization  and  were,  therefore,  defined 
when  used  in  a  question.  This  technique  was  fairly  successful  and  should 
be  used  even  more  extensively  in  the  future. 

In  addition  to  face  validity,  we  considered  the  accuracy  of  the  data.  To 
determine  the  accuracy  of  the  responses  of  ill  key  items  of  93  in  the  ques¬ 
tionnaire,  we  asked  responders  to  assess  the  accuracy  of  their  own  answers 
to  these  items.  They  coded  their  assessment  according  to  the  following 
three  categories: 


Data  Accuracy  Index 


Record 

1  Very  accurate 

2  Good  estimate 

3  Unreliable 


Memory 

4  Accurate  recollection 

5  Good  guess 

6  Very  hazy 


Judgment 

7  Confident 

8  Good  guess 

9  Estimate 


Appendix  IV  contains  a  frequency  count  of  the  estimated  accuracy  of  the 
responses  to  each  of  the  kk  questions.  These  responses  were  not  used  in 
any  explicit  way.  If  the  resulting  regression  equations  had  displayed 
smaller  confidence  limits  a  closer  examination  of  the  accuracy  of  the  input 
data  would  have  been  made  to  more  completely  insure  our  confidence  in  the 
results. 


Design  of  the  Sample.  In  the  classical  sense,  there  was  no  rigorous  design 
of  the  sample.  To  expedite  the  analysis  for  this  first  iteration,  only  data 
within  the  system  Development  Corporation  were  collected.  The  types  of 
programs  for  which  data  were  collected,  however,  represented  a  fairly 
broad  range:  responses  were  received  for  operational  programs,  utility 
programs,  and  support  programs,  all  within  the  category  of  command  and 
control  systems. 
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As  pointed  out  earlier,  two  definitions  were  used  to  bound  the  data  sample. 
First,  the  same  set  of  programming  activities  comprised  the  program  develop¬ 
ment  effort  in  each  observed  case.  We  defined  the  scope  of  the  programming 
job  to  begin  with  the  program  design  activity  and  to  end  with  program  test 
(not  including  system  test).  Some  questions  were  asked,  however,  about 
participation  in  the  operational  design  activity.  These  activities  of  the 
programming  process  used  as  a  base  are  described  in  Reference  (7)» 

Second,  a  program  unit,  i.e.,  a  "data  point,"  a  member  of  the  data  smaple, 
was  defined  to  be:  the  smallest  set  of  instructions  (a)  whose  purpose  is 
defined  by  someone  other  than  the  programmer,  (b)  which  is  delivered  to  the 
user  or  customer  as  a  package,  and  (c)  which  is  loaded  into  the  computer  as 
a  program  unit  or  system  to  achieve  the  stated  purpose  or  objective.  By 
this  definition,  a  program  data  point  can  be  an  operational  program,  a 
utility  program,  or  even  an  experimental  or  prototype  program.  The  user 
of  the  program  may  be  the  buyer,  or  he  may  be  another  programmer,  as  in 
the  case  of  a  utility  program. 

Ideally,  the  number  of  data  points  for  analysis  should  equal  or  exceed  the 
total  number  of  variables  being  considered  for  inclusion  in  the  cost  predic¬ 
tion  equations.  In  addition,  the  points  should  range  uniformly  across  the 
cost  domain  in  which  we  wish  to  make  estimates,  i.e.,  the  mathematical  surfaces 
fitted  by  regression  analysis  should  be  securely  anchored  in  the  solution  space 
and  not  subject  to  excessive  translation  or  rotation  when  cross-validated  to 
new  data  samples.  A  basic  problem  in  this  study  was  the  small  sample  size. 

An  excessive  imbalance  between  number  of  data  points  and  number  of  variables 
led  to  lack  of  complete  confidence  in  rejecting  potential  predictor  variables 
and  contributed  to  the  somewhat  large  confidence  limits  that  characterize 
the  prediction  equations  derived.  Two  associated  problems  created  by  survey 
limitations  were  (a)  a  poor  distribution  of  program  sizes  measured  in  machine 
language  instructions  (i.e.,  many  small  programs,  few  large  programs),  and 
(b)  a  probable  organizational  bias  in  examining  the  experience  of  only  one 
company.  These  problems  will  be  discussed  in  more  detail  later  in  the 
report. 

Administration  of  the  Questionnaire.  The  questionnaire,  instructions  for  its 
completion,  and  background  information  on  the  objectives  of  the  project  were 
sent  to  the  managers  of  the  three  Divisions  responsible  for  developing 
computer  programs  within  the  Corporation.^-  We  suggested  the  major  contract 
areas  within  these  Divisions  where  we  felt  there  would  be  a  number  of  mean¬ 
ingful  data  points.  We  further  suggested  that  the  subordinate  managers 


Air  Defense  Division 
Washington  Division 
Command  Control  Division 
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responsible  for  the  development  of  these  programs  determine  how  best  to 
partition  their  program  systems  in  accordance  with  our  definition  of  a  data 
point.  Each  questionnaire  (i.e. ,  data  point)  was  then  further  delegated  to 
the  people  most  qualified  to  provide  the  required  information.  As  mentioned 
above,  the  questionnaire  was  designed  to  be  easily  divided  and  delegated. 

We  held  short  meetings  with  the  recipients  of  the  questionnaires  to  explain 
the  intent  of  the  questionnaire  and  to  answer  questions  regarding  the  informa¬ 
tion  requested.  Efforts  on  the  part  of  the  responders  in  completing  the 
questionnaires  ranged  from  one  to  five  man  days.  After  receipt  of  the 
completed  questionnaires,  we  effected  follow-up  communications  where 
necessary  to  request  explanations  of  answers  that  were  unclear,  ambiguous 
or  nonresponsive. 

Statistical  Analysis 


General  Approach.  The  basic  statistical  technique  used  was  multivariate 
regression  analysis.  Mathematically,  this  procedure  involves  the  derivation 
of  the  equation  of  a  surface  that  fits  as  closely  as  possible  the  observed 
data  points  (see  References  1,  2,  and  3)»  In  using  statistical  techniques 
to  solve  a  heretofore  completely  unstructured  problem,  we  were  faced  with 
three  major  problems:  (a)  the  recognized  unreliability  of  the  data,  (b)  the 
relative  scarcity  and  poor  distribution  of  data  points  in  the  sample,  and 
(c)  the  unfavorable  ratio  of  data  points  (sample  size)  to  variables,  i.e., 
many  more  variables  than  available  data  points.  Despite  these  problems, 
the  statistical  techniques  employed  were  sufficiently  robust^  to  produce 
meaningful  results. 

During  the  time  allotted  for  this  study,  little  could  be  done  to  solve  the 
first  two  problems.  The  basic  methods  of  regression  analysis  and  factor 
analysis  were  supplemented  by  correlation  analysis  and  intuitive  analysis 
in  order  to  deal  with  the  problem  of  imbalance  between  data  points  and 
variables.  Initially,  the  analysis  of  cost  factors  in  computer  program 
development  led  to  the  identification  of  93  variables  (i.e.,  questionnaire 
items)  that  were  believed  to  be  associated  with  costs.  Generally  speaking, 
the  number  of  data  points  should  have  exceeded  the  number  of  such  variables 
to  obtain  a  trustworthy  analysis.  Thus,  in  this  problem,  we  would  have 
preferred  several  hundred  data  points  to  use  as  a  basis  for  selecting  the 
best  variables  and  determining  their  proportionate  relevance  in  cost 
estimation.  As  it  turned  out,  a  major  analytical  problem  concerned  the 
reduction  of  the  total  number  of  potential  predictor  variables  to  a  lesser 


'A  robust  technique  is  considered  here  to  be  one  that  is  relatively  insensitive 
to  departures  from  the  assumptions  and  conditions  on  which  it  has  been  theo¬ 
retically  based. 
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number  of  representative  variables  while  proceeding  to  the  initial  development 
of  prediction  equations.  Even  with  the  unfavorable  data  point-to- variable 
ratio,  it  was  possible  to  apply  statistical  techniques  as  an  aid  in  selecting 
desirable  variables.  However,  to  compensate  for  the  inherent  instability  of 
statistical  procedures  based  on  small  and  poorly  distributed  samples,  we 
relied  heavily  upon  the  program  system  development  knowledge  and  experience 
available  to  us. 

Other  approaches,  consistent  with  the  fundamental  goals  of  multivariate 
regression  analysis,  were  used  to  select  variables  for  further  analysis. 

In  general  terms,  the  criteria  for  selection  of  the  "best"  variables  were 
as  follows: 

1.  Validity — the  extent  to  which  each  predictor  variable  individually 
accounted  for  cost  variance.  Initially,  the  correlation  coefficients 

of  al 1  variables  with  respect  to  major  costs  were  examined.  As  analysis 
progressed,  standardized  regression  coefficients  on  specific  costs  were 
used  to  refine  the  selection. 

2.  Independence- -the  extent  to  which  each  predictor  variable  was  free  of 
relationship  to  other  predictor  variables.  This  was  observed  by  examin¬ 
ing  the  inter correlations  among  predictor  variables. 

3.  Confidence — the  extent  to  which  each  predictor  variable,  when  included 
in  a  multivariate  prediction  equation,  would  tend  to  increase  the  confi¬ 
dence  that  can  be  placed  in  the  predicted  cost  parameter.  The  available 
theory  provided  useful  indices  such  as  standard  errors  of  estimate  and 
confidence  limits.  Confidence  estimation  was  the  key  aspect  of  the 
current  analysis.  This  important  topic  is  discussed  more  fully  in  the 
section  describing  the  prediction  model. 

4.  Distribution  Quality — the  extent  to  which  each  predictor  variable  tended 
to  be  distributed  without  large  gaps  and  without  severe  skewness  to 
either  high  or  low  values.  On  occasion,  transformations  of  variables 
by  logarithms  were  employed. 

5.  Missing  Data — the  extent  to  which  each  predictor  variable  was  free  of 
missing  or  approximated  data.  A  working  principle  suggested  that 
variables  that  were  so  difficult  to  assess  as  to  have  frequent  missing 
values  were  probably  poor  variables  for  practical  prediction  purposes. 

6.  Intuitive  Considerations — general  opinions  and  experience  concerning 
the  usefulness  of  a  variable  for  prediction  purposes. 

While  intuitive  considerations  pervaded  the  entire  variable  selection 
procedure,  considerations  of  validity,  independence,  and  confidence  were 
weighted  most  heavily  in  the  regression  analyses,  and  considerations  of 
distribution  characteristics  and  missing  data  were  primarily  confined  to 
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the  initial  analysis  of  raw  data.  However,  a  few  appealing  variables  having 
inferior  distribution  characteristics  or  approximated  data  were  included  in 
some  of  the  regression  analyses  but  were  given  a  low  selection  priority. 

Prediction  Model.  The  statistical  analyses  for  this  study  were  based  on  the 
foundations  of  multivariate  regression  analysis.  Although  this  statistical 
approach  attempts  to  provide  a  model  in  which  the  cost  of  computer  programs 
is  related  structurally  to  independent  factors,  the  major  emphasis  in  the 
model  vised  here  is  not  on  the  statistical  rigor  with  which  the  prediction 
equations  are  derived  but  on  the  practical  accuracy  and  usefulness  of  the 
equations  in  the  actual  task  of  estimating  computer  programming  costs. 

Simple  predictive  efficiency,  although  acknowledged,  is  not  emphasized. 
Instead.,  the  goal  is  to  provide  a  tool  that  is  sufficiently  valid  to  be 
useful  outside  of  the  particular  data  pattern  on  which  the  empirical 
analysis  is  based. 

Statistical  tests  available  for  evaluating  the  estimating  efficiency  of 
equations  from  their  sample  data  are  important  but  insufficient  indicators 
of  the  quality  of  a  model  of  this  type.  Experience  has  frequently  revealed 
that  equations,  although  satisfying  rigorous  estimation  criteria  in  the 
sample  from  which  they  were  derived,  still  perform  rather  poorly  when 
applied  to  new  data.  The  ultimate  value  of  a  prediction  equation  lies  in 
the  extent  to  which  it  can  make  useful  predictions  outside  of  the  data  sample 
on  which  it  was  based.  This,  of  course,  places  a  great  responsibility  on  the 
research  program  in  acquiring  data  sufficient  in  quantity,  representativeness 
and  practicality  to  warrant  application  to  the  domain  in  which  predictions 
are  to  be  made. 


Initially,  it  was  assumed  that  an  enduring  linear  relationship  exists  between 
costs  (Y^)  and  various  suitably  weighted  subsets  of  predictor  variables  X&, 

X^,  ...X^  Mathematically,  the  basic  task  of  analysis  involved  the  fitting, 

by  least-squares  procedures,  of  hyperplanes  (i.e. ,  flat  surfaces  in  three  or 
more  dimensions)  to  a  sample  of  data  points  arrayed  in  m  +  1  orthogonal 
dimensions.  This  model  may  be  compactly  expressed  as  follows: 

\  +A  BiXi  +  \  w 


Yk  = 


i=l 


where:  Y^  is  the  value  of  the  kth  cost  dimension  to  be  estimated. 

is  a  constant  that  may  be  either  positive  or  negative  in  all 
estimates  for  a  particular  Yk» 

is  the  weight  to  be  assigned  to  the  ith  predictor  variable  to 
optimize  the  overall  accuracy  of  the  equation. 


X^  is  the  numerical  value  for  the  ith  predictor  variable, 
m  is  the  number  of  predictor  variables  used  in  the  prediction  of  Y,  . 


E^  is  the  portion  of  that  cannot  be  estimated  by  any  weighting  of 
the  X^.  This  is  known  as  the  error  term.  It  may  be  positive  or 
negative  and  will  vary  randomly  from  data  point  to  data  point. 


Although  the  linear  prediction  model  can  be  extended  to  the  quadratic  case, 
it  was  not  used  in  this  study  due  to  the  relative  scarcity  of  data  points. 
However,  future  analysis  may  suggest  this  type  of  modification. 


In  multivariate  estimation,  the  critical  element  in  the  equation  is  the 

term,  because  this  defines  the  statistical  confidence  that  one  may  place  in 
the  equation.  At  one  extreme,  the  term  may  be  zero  for  all  observations, 


which  would  define  a  perfect  estimating  equation.  At  the  other  extreme,  the 
contribution  of  the  X^  would  be  zero,  and  the  equation  would  be  worthless. 

In  this  case,  the  distribution  of  E^  would  be  approximated  by  the  standard 


deviation  of  the  Y,  values  from  the  arithmetic  mean  of  Y,  .  The  mean  would, 

K  K. 

in  all  such  cases,  be  the  only  reasonable  estimate  for  any  Y^  because  it 

would  lead  to  the  least  error  of  estimate,  overall. 


In  actual  practice,  the  distribution  of  E^  values  will  lie  somewhere  between 

the  two  extremes  described  above.  For  this  purpose,  a  fundamental  statistical 
parameter  called  the  standard  error  of  prediction  is  available.  This  device 
was  designed  to  be  used  when  the  estimation  errors  are  expected  to  be  approx¬ 
imately  normally  and  independently  distributed.  The  formula  for  this  parameter 
is  as  follows: 


/  m 

=  \'  1  +  l/N  + 

hj 

1 

where:  <j(Y^)  is  the  standard  error  of  prediction  for  an  individual  Y^ 

derived  from  selected  X^.  This  parameter  defines  the  limits 

within  which  one  can  expect  the  true  Y,  to  fall  two-thirds 
of  the  time. 

CTg  is  the  standard  error  of  estimate,  defined  as  the  root  mean 
square  error  adjusted  for  sampling  bias  and  the  number  of 

predictors  used,  i.e. ,  /  ,  where  E  =  actual  Y  minus 

v  N-m-1 

Y  computed  from  the  regression  formula. 

N  is  the  number  of  data  points  on  which  the  estimation  weights 
are  based. 


TZ 
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m 


is  the  number  of  predictor  variables. 


i,j  are  subscripts  vised  to  define  cross-multiplication  among 
predictors. 


c 


a 


are  multipliers  used  to  weight  the  cross-products  of  predictor 
deviations  from  the  mean.  These  multipliers  are  obtained  from 
the  inverse  of  the  augmented  correlation  matrix  by  the  following 
formulas : 


c..  =  ayy  aii  a  yi  (3) 

(N-l)  cr2,  a 

v  i  yy 

and  c  =  ayy  aij  ayi  ayj  (4) 
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where:  a  is  the  value  of  the  inverse  element  for  the  dimension  to  be 
^  predicted. 

a^  is  the  value  of  the  diagonal  inverse  element  for  the  ith  variable. 

a  .  is  the  value  of  the  inverse  element  at  the  row-column  juncture 
^  of  y  and  i. 

a  .  is  the  value  of  the  inverse  element  at  the  row-column  juncture 
^  of  y  and  j. 

N  is  the  number  of  data  points. 


c.  and  c 


j 


are  unbiased  estimates  of  the  population  standard  deviation  for 
the  predictor  variables  arrayed  in  i  rows  and  j  columns. 


V 


X. 

J 


are  the  deviations  of  the  predictor  (X.)  values  from  their 
respective  means. 


In  the  particular  case  vhere  all  predictor  variables  are  taken  at  their 
respective  arithmetic  means,  the  above  formula  for  the  standard  error  of 
prediction  reduces  to: 


a(Yk)  =  a/l  +  1/N  (5) 

For  example,  when  N  =  26,  c(Y.  )  =  1.02 

It  is  customary  in  confidence  estimation  to  use  approximately  _+2c(Yk)  to 

establish  the  95  percent  confidence  limits  for  a  predicted  value.  This 
provides  the  extremes  within  which  the  true  Y^  value  can  be  expected  to 
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fall  95  percent  of  the  time.  For  analyses  based  on  small  samples,  the 
confidence  limits  must  be  expanded  to  account  for  the  lesser  stability  of 
the  predictions.  For  example,  when  the  number  of  data  points  is  2 6  and  the 
number  of  predictor  variables  is  h,  one  must  use  _^2.08a(Y^)  rather  than 

2a(Y,  )  to  establish  the  95  percent  confidence  limits.  The  use  of  these 

K 

devices  provides  a  safeguard  against  unwarranted  acceptance  of  statistical 
results  derived  from  small  samples.  The  results  of  using  equation  (5)  for 
determining  confidence  limits  are  included  in  the  tables  of  Appendix  VII, 
which  summarize  the  results  of  the  correlation  and  regression  analysis.  In 
subsequent  research,  it  is  anticipated  that  the  more  complete  calculation  of 
confidence  limits  using  equation  (2)  will  be  appropriate.  This  technique 
will  allow  the  calculation  of  confidence  limits  for  specific  values  of  the 
predictor  variables,  in  each  use  of  an  estimating  equation. 

Selection  of  Predictor  Variables.  As  noted  above,  a  primary  problem  facing 
the  investigators  was  to  reduce  the  number  of  predictor  variables  to  be 
submitted  to  the  regression  analysis.  Clearly,  the  93  predictor  variables 
had  to  be  reduced  to  less  than  27  (the  available  number  of  data  points) 
before  the  regression  technique  could  be  applied.  In  Appendix  II,  we  have 
listed  definitions  of  all  predictor  variables  and  their  coding.  These 
variables  are,  in  actuality,  the  questionnaire  items  described  in  Appendix  I. 
Appendix  V  is  a  validity  table  summarizing  these  same  variables  and  their 
individual  correlations  with  costs.  Below  we  describe  how  we  used  the 
principles  mentioned  earlier  to  select  or  reject  predictor  variables  for 
further  analysis.  These  principles  were  applied  in  several  overlapping 
phases:  examination  of  raw  data,  correlation  analysis,  regression  analysis 
and  factor  analysis.  The  results  of  the  selection  process  are  recorded  in 
Section  IV,  Summary  of  Results. 

1.  Examination  of  Raw  Data.  The  responses  to  the  questionnaire  were  tabulated 
in  a  data  matrix  (Appendix  III)  in  which  each  column  (variable)  was  care¬ 
fully  examined.  Ten  of  the  original  variables  were  immediately  rejected 
for  one  or  more  of  the  following  reasons: 

a.  Lack  of  variance  or  a  predominance  of  constant  values.  For  example, 

if  a  yes  or  no  question  exhibited  more  than  twenty  identical  responses, 
the  variable  was  rejected. 

b.  Identity  with  other  variables.  In  cases  where  columns  displayed 
identical  or  near-identical  entries  to  other  columns,  a  rejection 
of  one  of  the  variables  was  made. 

c.  Poor  distribution  characteristics.  If  examination  of  the  data 
revealed  large  gaps  (discontinuities)  or  highly  skewed  (unbalanced) 
results,  the  variable  was  rejected. 


17 


d.  Excessive  amount  of  missing  data.  In  cases  where  only  a  few  cells 
were  missing,  the  investigators  filled  these  in  as  accurately  as 
possible.  However,  if  many  entries  were  missing,  predictor  variables 
were  rejected,  hut  cost  variables  were  retained  for  further  analysis. 

e.  Apparently  ambiguous  question.  If  the  majority  of  responses  appeared 
to  be  incorrect,  that  is,  not  responsive  to  the  intent  of  the  question, 
the  variable  was  rejected. 

f.  Dependence  on  other  variables.  For  example,  in  a  series  of  ratio 

or  percentage  variables  adding  up  to  100  percent,  one  variable  could 
be  rejected  as  being  dependent  on  the  others. 

g.  Lack  of  strong  intuitive  appeal.  This  criterion  generally  pervaded 
the  rejection  of  variables  throughout  the  research. 

2.  Examination  of  Correlations.  At  this  point,  there  were  83  "independent" 
and  15  dependent  variables  under  consideration.  The  first  computer  run 
consisted  of  the  computation  of  a  98  by  98  correlation  matrix,  which 
depicted  the  statistical  relationship  of  every  variable  with  every  other 
variable.  Each  predictor  variable  was  examined  first  for  its  correlation 
with  costs  as  a  preliminary  means  of  checking  its  validity.  Variables 
with  low  correlations  and  spuriously  signed  correlations  were  then 
considered  for  possible  rejection.  Because  a  considerable  number  of 
variables  had  to  be  rejected  before  regression  analysis  could  be  attempted, 
variables  with  low  validity  coefficients  were  not  accepted  unless  they  had 
strong  intuitive  appeal.  In  all  cases  where  variables  were  selected  or 
rejected,  they  were  checked  for  meaningfulness,  unambiguousness,  avail¬ 
ability,  and  general  appeal.  These  criteria  are,  of  course,  all  subject 
to  the  investigators '  judgment  and  intuition.  Highly  valid  predictor 
variables  were  examined  for  their  intercorrelations.  We  realized  that 
highly  intercorr elated  predictor  variables,  even  though  valid, ^  would 


■kiiven  approximately  the  same  validity  level  among  predictors,  an  equation  based 
on  more  unique  independent  variables  will  be  more  trustworthy  than  one  based  on 
highly  correlated  variables;  this  is  because  multicollinearity  increases  the 
sensitivity  of  parameter  estimates  to  such  things  as  changes  in  the  set  of 
independent  variables  used,  the  relative  presence  or  absence  of  extreme  obser¬ 
vations  and  the  direction  of  minimization.  This  thereby  reduces  one's  confi¬ 
dence  in  the  usefulness  of  whatever  structural  estimates  happen  to  emerge. 
Reference  (8)  provides  a  more  thorough  technical  discussion  of  this  important 
topic.  Although  no  standard  test  of  significance  exists  for  evaluating  the 
extent  to  which  multicollinearity  affects  an  equation.  Formula  (2)  (standard 
error  of  prediction)  is  considered  to  be  useful  for  evaluating  competitive 
equations,  since  it  takes  into  consideration  the  nature  of  the  inverse,  and 
consequently,  the  relative  value  of  the  determinants  of  the  predictor  inter¬ 
correlation  matrices  from  which  the  equations  have  been  derived.  We  were 
unable  to  utilize  the  standard  error  of  prediction  as  an  evaluation  device  in 
the  current  analysis  due  to  lack  of  time  and  a  suitable  computer  program. 
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tend  to  complicate  the  regression  analyses  should  they  be  allowed  to 
compete  with  each  other  in  accounting  for  variance;  to  avoid  spurious 
results  such  as  negative  regression  coefficients  for  variables  which  are 
positively  related  to  costs,  we  decreased  the  number  of  highly  intercor- 
related  predictors  by  selecting  the  most  appealing  of  the  competing 
variables.  This  technique  was  also  designed  to  increase  the  true  inde¬ 
pendence  or  uniqueness  of  the  predictor  variables  being  considered  for 
inclusion  in  the  equations. 

3.  Examination  by  Regression  Analysis.  At  this  point,  the  number  of  variables 
selected  for  further  analysis  was  over  50*  This  still  greatly  exceeded  the 
number  of  available  data  points.  Therefore,  the  number  of  variables  was 
further  reduced  and  divided  into  two  groups.  One  group  was  labeled  "most 
preferred"  and  consisted  of  15  predictor  variables;  the  other  was  labeled 
"satisfactory"  and  consisted  of  21  predictor  variables.  At  this  point, 
multivariate  analysis  was  introduced  to  further  reduce  the  number  of 
variables. 

A  multiple  regression  analysis  program  (9)  along  with  an  IBM  709*+  computer 
were  the  primary  computational  tools  used  by  the  investigators,  although 
other  computer  programs  were  also  used  in  support  of  this  effort.  The 
linear  multiple  regression  program  we  used  can  perform  a  complete  analysis 
on  as  many  as  80  variables,  provided  enough  data  points  are  available  to 
justify  the  computations.  The  following  quantities  are  computed  and 
output:  sums  and  sums  of  squares,  means,  sample  size,  standard  deviations, 
the  intercorrelation  matrix,  standardized  and  weighted  regression  coeffi¬ 
cients,  the  standard  error  of  estimate,  the  coefficient  of  determination, 
the  multiple  correlation  coefficient,  and  the  constant  in  the  regression 
equation. 

The  program  also  selects  subsets  of  independent  variables  that  yield  near¬ 
maximum  multiple  correlations  (i.e.,  near-minimum  residuals).  Once  the 
program  selects  a  subset  of,  say,  m  variables,  it  computes  and  outputs  the 
following  statistics:  values  of  a  gradient  selection  index  for  each 
variable,  standardized  and  weighted  regression  coefficients,  the  regression 
constant,  the  coefficient  of  determination,  the  multiple  correlation 
coefficient,  the  shrunken  multiple  correlation  coefficient,  the  standard 
error  of  estimate,  the  increase  in  the  multiple  correlation  from  the 
previous  subset,  the  change  in  the  shrunken  multiple,  the  decrease  in 
the  multiple  correlation  from  the  complete  set  of  independent  variables 
and  the  corresponding  F  ratio.  The  program  will  continue  selecting 
larger  and  larger  subsets  of  predictor  variables  until  a  predesignated 
stop  criterion  is  satisfied. 

In  the  first  regression  analysis,  we  planned  to  use  the  subsetting  feature 
and  the  computation  of  the  standard  error  of  estimate  to  assist  in  further 
rejecting  variables.  Specifically,  we  expected  the  minimum  standard  error 
of  estimate  to  occur  after  the  selection  of  about  four  to  eight  variables, 
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whereupon  the  remaining  variables  would  be  rejected.  When  first  used,  this 
technique  did  not  give  a  clear  indication  of  which  variables  to  consider  for 
further  selection.  In  fact,  the  minimum  standard  error  occurred  in  most  cases 
after  selecting  all  the  variables  submitted  to  it.  We  found  that  data  point  5, 
an  unusually  large  deviate  in  the  multiple  solution  space,  was  the  anomaly  in 
the  analytical  process.  When  this  point  was  removed  on  the  fourth  regression 
analysis,  the  computed  solutions  proceeded  in  a  straightforward  manner  and 
subsets  of  variables  with  minimum  standard  error  were  readily  apparent.  Hence, 
the  majority  of  results  that  follow  are  based  on  a  sample  of  2 6  data  points 
rather  than  the  27  for  which  data  were  collected. 

In  the  second  regression  analysis,  the  "most  preferred"  and  the  "satisfactory" 
variable  (see  Tables  I  and  II  in  Section  TV)  groups  were  reexamined  on  the 
basis  of  all  criteria,  with  special  emphasis  placed  on  identifying  and  rejecting 
redundant  variables.  The  two  groups  were  then  consolidated  into  one  group  of 
17  "best"  variables  that  was  subjected  to  further  regression  analysis.  To 
continue  the  process  of  rejecting  variables,  we  used  standardized  partial 
regression  coefficients  as  a  means  for  evaluation.  Variables  with  coefficients 
less  than  .10  and  those  with  an  algebraic  sign  inconsistent  with  good  judgment 
were  generally  rejected.  The  number  of  predictor  variables  associated  with  each 
cost  variable  was  thus  reduced  to  less  than  ten.  These  variables  were  then 
submitted  to  a  third  and  fourth  regression  analysis,  the  results  of  which  are 
described  in  Section  TV. 

We  conducted  a  fifth  regression  analysis  to  completely  eliminate  the  potential 
bias  introduced  by  data  points  h,  5,  and  6,  the  extremely  large  programs,  in 
terms  of  man  months  and  number  of  instructions.  This  final  analysis  also  used 
the  number  of  delivered  instructions  as  a  predictor  variable  in  place  of  the 
companion  variable,  number  of  instructions  originally  estimated.  When  all 
extreme  data  points  were  removed,  the  scatter  plot  relationships  between  costs 
and  delivered  instructions  (see  Figure  5>  Section  IV)  were  more  meaningful  and 
trustworthy  than  those  using  the  ait er -the- fact  reports  of  estimated  instructions. 
Since  both  variables  were  collected  simultaneously,  this  appeared  to  be  a  reason¬ 
able  choice  of  alternatives. 

Because  the  number  of  delivered  instructions  played  such  a  dominant  role  in 
this  study,  a  companion  analysis  was  performed  to  derive  an  equation  for 
estimating  delivered  instructions  from  other  predictor  variables,  completely 
excluding  the  variable,  estimated  instructions.  The  results  of  this  analysis 
are  described  in  Section  TV. 
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The  following  is  a  summary  of  the  five  regression  analyses: 


Analysis 

Variables 

Considered 

Comments 

First 

Tables  I  &  II 

We  intended  to  reject  variables  appearing  in 
results  after  minimum  standard  error  of  estimate 
was  achieved  (N  =  27)* 

Second 

Table  III 

We  selected  variables  for  further  analysis  on 
basis  of  satisfactory  standardized  regression 
coefficients  and  meaningfulness  (N  =  27). 

Third 

Table  IV 

Specific  predictor  variables  were  grouped  with 
specific  cost  variables  (N  =  27). 

Fourth 

Table  IV 

We  repeated  the  previous  analysis  with  omission 
of  data  point  5  and  also  conducted  a  special 
analysis  to  derive  an  equation  for  estimating 
delivered  instructions  from  other  predictors 
(N  =  26). 

Fifth 

Table  V 

A  final  analysis  only  on  variable  84  (man  months) 
with  omission  of  data  points  4,  and  6  (N  =  24) 

4.  Examination  by  Factor  Analysis.  In  addition  to  the  techniques  described 
earlier,  we  also  initiated  the  use  of  factor  analysis  (10)  as  a  means 
for  studying  the  relationships  among  the  cost  predictor  variables.  This 
technique  allowed  the  predictor  intercorrelation  matrix  to  be  described 
by  a  smaller  number  of  independent  entities,  called  factors,  that  helped 
to  account  for  the  observed  intercorrelations  in  the  matrix.  Using  an 
IBM  7094  computer  program  (ll),  we  obtained  a  table  of  factor  loadings 
showing  the  relationship  between  each  variable  and  each  factor.  Viewed 
geometrically,  these  loadings  represent  the  projections  of  the  variables 
(as  vectors)  on  referent  axes  in  an  orthogonal  multidimensional  coordinate 
system.  Since  the  referent  axes  are  rather  arbitrarily  defined  in  the 
basic  calculation  process,  they  may  be  rotated  to  any  position  that  will 
enhance  the  description  of  the  original  data.  Another  IBM  7094  computer 
program  (12),  employing  a  varimax  method  of  factor  rotation,  was  used  in 
this  study  to  achieve  factorial  description  of  the  83  variables  in  the 
predictor  pool.  The  table  shown  in  Appendix  VI  illustrates  the  results 
of  using  this  approach. 

Factor  analysis,  like  regression  analysis,  requires,  among  other  things, 
a  favorable  data  point-to-variable  ratio  for  its  successful  application. 
Since  the  results  shown  in  Appendix  VI  were  based  on  only  2 6  data  points 
drawn  exclusively  from  one  organization's  experience,  and  the  question¬ 
naire  used  to  obtain  these  points  is  in  its  first  experimental  phase, 
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they  are  not  to  be  interpreted  as  definitive  and  exhaustive  of  the 
computer  programming  domain.  Although  the  purpose  of  factor  analysis  is 
aimed  more  at  description  than  at  prediction,  it  was  felt  that  this  approach 
could  provide  a  valuable  adjunct  to  regression  analysis  in  the  search  for 
unique  and  valid  variables  for  predicting  computer  programming  costs. 
Accordingly,  in  this  study,  the  factor  composition  of  variables  was  taken 
into  consideration,  along  with  the  other  criteria,  when  variables  for 
regression  analysis  were  selected. 

Evaluation  of  Approach 

As  an  initial  exercise  in  analysis  of  programming  costs,  this  study  has 
outlined  problem  areas  and  suggested  ways  to  continue  both  the  data  collection 
and  statistical  analysis  more  effectively.  Most  importantly,  a  revision  of 
the  questionnaire  is  indicated  to  improve  the  relevancy  and  clarity  of  the 
data  to  be  collected.  The  analytical  techniques  just  described,  although 
powerful,  appropriate  tools  for  the  examination  of  a  highly  complex  multi¬ 
variate  problem,  require  a  relatively  large  data  sample  to  produce  reliable 
and  valid  results.  Since  we  did  not  have  a  large  sample  size,  the  results  in 
the  next  section  should  be  considered  as  examples  of  the  anlytical  techniques 
rather  than  recommended  prediction  devices.  An  improved  questionnaire  design 
that  is  pointed  at  minimizing  the  effort  required  to  complete  it  probably  will 
help  us  collect  data  from  a  larger  and  more  representative  audience. 

Improvements  of  the  questionnaire  and  data  collection  should  focus  on  the 
following: 

a.  Improved  definitions  of  terms.  For  example,  terms  such  as  data  point, 
programming  tools,  concurrence,  as  well  as  many  others  are  in  need  of 
more  concise  and  explicit  definition.  This  is  especially  necessary  to 
collect  meaningful,  comparable  data  from  organizations  outside  of  SDC. 

b.  Design  of  dichotomous  questions  for  ease  of  aggregation. 

c.  Extension  of  the  scope  of  the  program  development  effort  being  examined 
to  include  system  analysis,  as  well  as  installation  and  maintenance 
activities.  Although  difficulties  may  be  encountered  in  analyzing  a 
nonhomogeneous  population,  this  larger  view  is  much  more  realistic  and 
logical  in  attempting  to  account  for  all  the  factors  that  affect  the 
cost  of  programming. 


In  the  present  analysis,  the  dichotomous  variables  fared  rather  poorly  in 
predictor  variable  selection.  It  is  quite  probable  that  the  small  variance 
of  such  variables  acted  as  a  deterrent  against  their  selection  when  they 
were  matched  against  quantitative  variables  of  much  larger  variance. 
Appropriate  aggregation  would  allow  higher  variance  with  the  resultant 
possibility  that  they,  as  a  group,  might  better  complement  the  quantitative 
variables  and  contribute  to  additional  prediction  accuracy. 
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d.  Elimination  of  questions  that  produce  response  variables  having  a 
marginal  or  spurious  contribution  to  costs.  It  would  be  highly  desirable 
to  reduce  the  size  of  the  questionnaire  by  eliminating  such  items.  How¬ 
ever,  this  can  now  be  done  only  with  very  low  confidence  with  the  sample 
data  available. 

e.  Addition  of  questions  to  focus  more  on  the  actual  data  processing  to  be 
performed,  the  organization  for  program  development  and  the  relative 
value  of  the  resulting  program  and  its  documentation. 

f.  More  detailed  and  complete  validation  of  the  cost  data  to  insure  some 
degree  of  accuracy. 

The  success  of  multivariate  analysis  for  cost  prediction  depends  to  a  great 
degree  on  the  clear  and  meaningful  definition  of  variables  and  the  ability  to 
collect  sufficient  amounts  of  reliable  data  associated  with  these  variables. 
Because  the  ultimate  significance  of  specific  variables,  i.e.,  presumed  cost 
factors,  is  unknown  and  very  little  data  collection  has  been  accomplished,  the 
entire  data  collection  and  analysis  process  must  be  iterative.  One  objective 
of  an  analysis  of  past  program  development  efforts  is  to  establish  a  data 
collection  and  reporting  plan  for  new  development  efforts.  Descriptive  terms 
must  be  challenged  and  often  redefined  and  new  terms  and  definitions  created 
as  work  progresses.  Research  data  collection  and  processing  procedures  must 
also  be  challenged,  evaluated,  and  perhaps  modified.  As  more  and  more 
relevant  data  become  available,  the  output  of  a  research  program  of  this  type 
can  be  expected  to  become  more  and  more  accurate  and  valuable.  The  maximum 
value  of  this  kind  of  analysis  can  be  obtained  by  submitting  results  to 
managers  for  actual  use.  Finally,  the  ongoing  nature  of  the  data  collection 
program  suggested  above  will  allow  the  timely  assessment  of  important  new 
factors  such  as  advanced  programming  techniques,  equipment  and  procedures 
that  are  being  introduced  into  computer  program  development. 


IV.  SUMMARY  OF  RESULTS 


Introduction 


This  section  details  the  results  of  the  predictor  variable  selection  process 
described  earlier,  presents  some  selected  regression  equations  derived  from 
the  statistical  analysis,  and  interprets  the  results  in  terms  of  their 
validity,  usefulness,  and  implications  for  further  work.  Associated  with 
each  regression  equation  are  error  indices  (residuals)  that  reveal  the 
specific  portion  of  the  cost  variance  unaccounted  for  by  the  equation. 

When  plotted  graphically,  these  residuals  readily  describe  how  the 
estimated  or  computed  value  of  the  cost  compares  with  the  actual  value. 

For  purposes  of  illustration,  this  section  presents  the  results  of  the 
regression  analyses  for  three  cost  variables:  (a)  man  months  for  program 
design,  code  and  test;  (b)  computer  hours;  and  (c)  number  of  delivered 
instructions.  Data  plots  for  these  cost  variables  are  also  provided. 
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A  detailed  summary  of  the  correlation  and  regression  analyses  for  all  cost 
variables  is  presented,  in  tabular  form,  in  Appendix  VII.  These  tables 
indicate  the  variables  considered  in  the  final  analysis  and  present  such 
pertinent  statistics  as  the  means,  standard  deviations,  validity  coefficients, 
intercorrelations,  standardized  regression  coefficients  and  confidence  limits. 

Discussion 


The  primary  objective  of  the  analysis  described  in  the  previous  section  was 
the  development  of  reliable  cost- estimating  equations.  Assuming  an  adequately 
large  and  representative  sample,  these  equations  will  predict  cost  such  that 
the  probable  errors  of  prediction  will  be  minimized.  Not  only  is  the  dependent 
variable  (cost)  of  great  interest  to  the  user  of  such  equations,  but  the 
regression  coefficients  themselves  imply  relative  significance  concerning 
the  independent  (predictor)  variables  comprising  the  estimating  equation. 
However,  the  reader  should  not  assume  that  control  of  statistically  derived 
predictor  variables  will  necessarily  control  costs.  The  significance  is 
primarily  statistical  and  not  necessarily  causal.  The  degree  of  causality  is 
related  to  such  things  as  the  meaningfulness  of  the  selected  variables  and 
the  relative  presence  or  absence  of  program  quality  and  performance  consider¬ 
ations  in  cost  estimation.  For  example,  if  in  the  equation  (see  Figure  l) 
for  the  cost  variable,  man  months,  we  reduce  the  numerical  value  for  the 
predictor  variable,  number  of  external  documents,  we  then  reduce  cost,  it  is 
also  possible  that  the  quality  of  the  program  may  be  reduced  drastically. 

The  equations  in  this  document  are  primarily  Illustrative  of  the  research 
methodology  and  are  not  recommended  for  use  in  actual  program  development 
efforts.  On  the  other  hand,  we  encourage  the  use  of  these  equations  on  an 
experimental  basis,  e.g.,  to  supplement  and  compare  with  other  estimation 
techniques.  Further,  reports  of  such  usage  will  be  extremely  valuable  in 
our  continuing  research. 

The  relatively  poor  distribution  of  data  in  the  cost  domain  requires  some 
discussion.  The  27  data  points  collected  in  this  study  consisted  of  3 
extremely  large  programs,  3  moderately  large  programs,  and  21  relatively 
small  programs  in  terms  of  number  of  instructions.  Mathematically,  this 
involved  the  fitting  of  a  regression  surface  across  large  areas  of  solution 
space  where  no  data  points  were  observed.  As  a  result,  the  equation  of  the 
cost  surface  favored  the  larger,  more  expensive  programs  represented  by  a 
small  percentage  of  data  points.  In  fact,  the  three  largest  points  affected 
the  investigation  so  adversely  that  they  were  all  purged  in  the  final  regression 
analysis.  During  the  analysis  we  began  the  purging  by  dropping  data  point  5, 
the  single  largest  data  point,  so  that  the  bulk  of  the  results  reflect  the 
analysis  of  26  data  points.  A  final  regression  analysis  for  cost  variable  84 
(man  months)  only  was  based  on  24  data  points  (the  three  largest  data  points: 

4,  5  and  6  removed). 
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One  goal  in  our  selection  of  predictor  variables  was  to  use  those  that  would 
be  available  or  easily  estimated  at  the  beginning  of  a  programming  effort. 

In  some  cases,  in  the  resulting  equations,  the  estimation  of  the  predictor 
vairables  is  easy;  in  others,  a  new  problem  arises.  The  best  example  of  this 
is  the  variable,  number  of  computer  program  instructions.  This  variable  has 
significant  correlation  with  cost;  however,  managers  historically  have  had  a 
difficult  time  in  estimating  instructions.  In  the  current  sample,  a  high 
correlation  (.9*0  was  observed  between  estimated  and  delivered  instructions. 
Since  data  on  both  variables  were  collected  simultaneously,  we  suspect  that 
some  contamination  may  have  occurred  to  yield  this  high  correlation.  Our 
approach  to  this  situation  was  to  use  the  variable  called  delivered  instruc¬ 
tions  as  a  key  predictor  (which,  incidentally,  increased  the  confidence  in  the 
man  months  equation  by  60  percent)  and  then  to  perform  a  separate  regression 
analysis  to  predict  delivered  instructions  without  using  estimated  instructions 
as  a  variable.  In  general,  this  approach  involves  reducing  the  larger  problem 
to  cost  estimation  to  a  series  of  smaller  and,  hopefully,  less  complex  problems 
of  estimating  the  components  of  cost. 

The  following  section  outlines  the  sequence  of  steps  we  used  in  selecting 
variables  for  regression  equations. 

Predictor  Selection 


In  the  section  on  methods,  we  pointed  out  the  need  to  reduce  the  number  of 
predictor  variables  before  a  meaningful  regression  analysis  could  be  attempted. 
A  principal  characteristic  of  regression  analysis  is  that,  as  the  number  of 
potential  predictors  increases  to  approach  the  number  of  data  points,  the 
solutions  (i.e.,  regression  coefficients)  tend  to  be  spurious.  This  fact 
viatiated  computerized  variable  selection  capability,  which  is  dependent, 
in  large  part,  on  the  computation  of  reliable  standardized  regression  coef¬ 
ficients.  Before  the  variable  selection  capability  of  regression  analysis 
could  be  used  with  some  degree  of  legitimacy,  the  original  set  of  potential 
predictors  had  to  be  reduced  by  correlation  analysis,  intuitive  analysis, 
and  factor  analysis.  Part  of  the  total  correlation  matrix  (a  validity  table), 
i.e.,  the  relationship  of  each  predictor  variable  to  each  cost  variable,  is 
presented  in  Appendix  V.  The  remainder  of  the  matrix,  the  intercorrelations 
of  all  the  predictor  variables,  has  been  withheld  to  conserve  space. 


It  should  be  pointed  out  that  estimated  instructions  was  originally 
considered  a  predictor  variable  and  delivered  instructions  a  cost 
variable. 
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The  first  predictor  variables  selected  for  regression  analysis  are  shown  in 
Tables  I  and  II.  These  were  chosen  on  the  basis  of  the  following  criteria: 
high  validity,  uniqueness,  meaningfulness,  availability,  and  general  appeal. 
Except  for  a  few  cases,  the  variables  omitted  from  Tables  I  and  II  were  no 
longer  considered  in  the  analysis.  Table  I  contains  a  list  of  the  15  'most 
preferred"  variables,  indicating  their  correlation  with  man  months,  the 
variable  number,  and  a  comment  that  further  characterizes  them.  Table  II 
contains  the  selection  of  an  additional  21  "satisfactory"  variables  with 
similar  descriptive  information.  With  the  small  sample  size  available,  the 
probability  of  making  unwarranted  rejections  of  variables  by  the  methods 
used  is  high.  The  two  separate  regression  analyses  performed  on  the  variables 
in  Tables  I  and  II  were  followed  by  a  further  selection  of  variables,  results 
of  which  are  shown  in  Table  III.  In  general,  the  variables  listed  in  Table  III 
were  selected  because  they  ranked  high  in  validity  and  meaningfulness. 

While  all  the  potential  predictor  variables  in  the  first  and  second  regression 
analyses  were  regressed  against  fifteen  cost  variables  (84  through  98) >  in  "the 
third  regression  analysis  we  selected  specific  groups  of  predictor  variables 
to  be  regressed  against  eight  major  cost  variables  on  the  basis  of  previously 
computed  satisfactory  standard  regression  coefficients  and  meaningfulness. 

The  results  of  these  selections  are  shown  in  Table  TV.  The  remaining  cost 
variables  were  either  eliminated  from  further  analysis  or  combined  into  new 
dependent  variables. ^  All  the  variables  in  Table  IV  were  run  again  in  a 
fourth  analysis  using  26  points,  data  point  5  having  been  omitted.  In  the 
fifth  regression  analysis,  data  points  4,  5,  and  6  were  eliminated  and  the 
correlation  analysis  was  repeated  to  select  variables  on  the  basis  of  new 
correlation  coefficients  (see  validity  table,  N  =  24,  Appendix  V).  Table  V 
lists  the  predictor  variables  considered  in  this  regression  analysis,  which 
was  completed  only  for  cost  variable  84  (man  months). 


^Variables  86  (average  number  of  programmers),  92  (computer  hours  for  program 
design  change),  93  (pages  of  documents  for  program  design  change)  and  97 
(number  of  other  personnel)  were  considered  to  be  poorly  conceived  and  of 
doubtful  value,  while  variable  99  (total  man  months)  became  the  stun  of 
variables  84,  85,  89,  and  98;  variable  100  (man  months  for  program  design 
change)  became  the  stun  of  variables  9 1  and  94.  All  variables  are  further 
defined  in  Appendix  II. 
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TABLE  I.  FIRST  REGRESSION  ANALYSIS 
Most  Preferred  Variables 


Correlation* 


Variable  No. 

with  Man 
Months  (84) 

Short  Variable  Description 

Comments 

11 

ON 

OO 

Number  of  instructions  in 
original  estimate  (1000’s) 

Dominant  predictor 

18 

.83 

Number  of  input  message  types 

Int ere or related  with  11, 
estimated  instructions 

21 

.80 

Number  of  subprograms 

Inter correlated  with  11, 
estimated  instructions 

39 

00 

Number  of  external  document 
types 

Inter correlated  with  11, 
estimated  instructions 

IT 

.70 

Number  of  data  base  classes 
(loG10) 

Intercorrelated  with  18, 
input  messages,  and  l6, 
words  in  data  base 

33 

.56 

Number  of  programming  tools 

Intercorrelated  with  31 
time  of  peak  program 
changes 

38 

.*5 

Number  of  internal  document 
types 

Intercorrelated  with  11, 
estimated  instructions 

HU 

.hi 

Number  of  words  in  core 
storage  (1000’s) 

Intercorrelated  with  64, 
terminations  per  month 

2  6 

.36 

Percentage  of  decision-making 
instructions 

High  appeal 

7  6 

.30 

Number  of  agencies  required 
for  concurrence 

Intercorrelated  with  77 
and  7B,  experience  and 
decision  capability  of 
agencies 

32 

-.30 

Language  type  used 

Possibly  spurious 
algebraic  sign 

23 

-.29 

Percentage  of  clerical 
instructions 

Low  appeal,  meaningful 
sign 

8 

.22 

Number  of  commands 

High  appeal,  low  validity 

29 

.20 

Timing  constraint 

High  appeal,  low  validity 

5 

-.12 

How  well  operational 
requirements  known 

Meaningful  sign,  low 
validity 

*These  coefficients,  based  on  2 6  data  points,  changed  significantly  when  all 
the  extremely  large  data  points  were  removed.  See  the  Validity  Table  for 
N=24  in  Appendix  V. 
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TABLE  II.  FIRST  REGRESSION  ANALYSIS 
Satisfactory  Variables 


Correlation* 


Variable  No. 

with  Man 
Months  (81*) 

Short  Variable  Description 

Comments 

83 

.92 

Number  of  trips  x  average 
miles /trip 

Highly  correlated  with  11, 
estimated  instructions 

30 

.78 

Number  of  program  design 
changes 

Very  difficult  to  estimate 

13 

.69 

Number  words  in  tables  and 
constants 

Not  available  early  in 
development 

10 

.67 

Complexity  rating 

Needs  more  quantitative 
definition 

65 

.k6 

Number  of  hires  per  month 

Moderate  validity 

64 

•  ^3 

Number  of  terminations  per 
month 

Moderate  validity, 
meaningful  sign 

h2 

-.*3 

Computer  operation  adequately 
documented 

Meaningful  negative  sign 

28 

.4o 

Program  design  constraints: 
insufficient  memory 

Moderate  validity 

hi 

-.32 

Was  computer  time  adequate 
for  parameter  test 

Meaningful  negative 
sign 

12 

.23 

Ratio:  new  instructions/ 
delivered  instructions 

May  be  difficult  to 
estimate 

1 

.17 

Innovation  in  operational, 
system 

Low  validity,  high 
appeal 

72 

-.17 

Document  for  cost  control 

Low  validity, 
meaningful  sign 

81 

.17 

Program  developed  at  site 
different  than  operational 

Low  validity,  high 
appeal 

60 

.08 

Ratio:  operational  design 
programmers  ^otal  programmers 

Low  validity,  high 
appeal 

80 

.07 

Computer  operated  by  another 
agency 

Low  validity,  high 
appeal 

56-58 

.41,59, 

.03 

Index  of  experience  for 

Types  I,  II,  and  III 

52-54 

.40,.  13, 

-3fc 

Percent  of  Programmers  by 

Types  I,  II,  and  III 


*These  coefficients,  based  on  26  data  points,  changed  significantly  when  all 
the  extremely  large  data  points  were  removed.  See  the  Validity  Table  for 
N=24  in  Appendix  V. 
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Correlation* 
with  Man 

TABLE  III 

SECOND  REGRESSION  ANALYSIS 

iable  No. 

Months  (84) 

Short  Variable  Description 

Comments 

11 

.89 

Number  of  instructions  in  original 
estimate  (1000‘s) 

Variable  16, 
number  of  words 
in  data  base,  was 

21 

.80 

Number  of  subprograms 

brought  into  the 
list  because  it 

39 

•  78 

Number  of  external  document  types 

was  statistically 
more  compatible 

13 

.69 

Number  words  in  tables  and  constants 

with  11,  number 
of  instructions 

10 

.67 

Complexity  rating 

in  original 
estimate,  and 

1 6 

.65 

Number  of  words  in  data  base  (log^) 

other  prominent 
predictors  than 

38 

.45 

Number  of  internal  document  types 

was  17,  number  of 
D/B  classes. 

64 

.43 

Number  of  terminations  per  month 

Variables  46, 
number  of  displays 

44 

.41 

Number  of  words  in  core  storage 
(1000’s) 

and  69,  plan  for 
unavailable 
computer,  were 

26 

.36 

Percentage  of  decision-making 
instructions 

also  re-entered 
due  to  their 
relative 

23 

-.29 

Percentage  of  clerical  instructions 

uniqueness  and 
high  appeal. 

8 

.22 

Number  of  commands 

However,  both 
were  later 

46 

.22 

Number  of  displays 

rejected 
for  reasons 

72 

-.17 

Document  for  cost  control 

of  low  predictive 
contribution. 

69 

.15 

Plan  in  the  event  of  unavailable 
computer 

5 

-.12 

How  well  operational  requirements 
known 

*These  coefficients,  based  on  26  data  points,  changed  significantly  when  all 
the  extremely  large  data  points  were  removed*  See  the  Validity  Table  for 
N=24  in  Appendix  V. 
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Cost-Estimating  Equations 


Four  of  the  resulting  cost-estimating  equations  from  the  fourth  and  fifth 
regression  analyses  are  presented  here  for  illustrative  purposes,  while 
eight  equations  of  interest  are  presented  with  additional  statistical  detail 
in  Appendix  VII.  The  first  equation  of  interest,  based  on  a  sample  of  2 6 
data  points,  estimates  man  months  for  program  design,  code,  and  test: 

y84  =  2-7xn  +  121X10  +  26x39  +  12X38  +  22Xi 6  “  497 
Standard  error  of  estimate  =  138  m/m 

95$  confidence  limit  at  the  mean  =  _+295  m/m 

Variables 

84  Man  months  for  program  design,  code,  and  test 
11  Number  of  instructions  in  original  estimate  (in  thousands) 

10  Complexity  rating  (scale  1-5) 

39  Number  of  external  document  types 

38  Number  of  internal  document  types 

l6  Number  of  words  in  data  base  (log^) 

Figure  1,  a  plot  of  actual  cost  versus  costs  estimated  with  this  equation, 
shows  residuals  (estimating  errors)  as  deviations  from  a  45-degree  line. 
Table  1  of  Appendix  VII  describes  the  statistical  characteristics  of  the 
variables  used  in  this  equation. 
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Figure  1.  Actuals  vs  Computed  for  Cost  Variable  84~ Man  Months  for 
Program  Design,  Code  and  Test  (N  =  2 6) 

The  second  equation  estimates  computer  hours  and  is  also  based  on  the  same 
2 6  data  points: 

y88  =  21.53^  +  +  1972^  -  3^68 

Standard  error  of  estimate  =  905  hours 

95$  confidence  limit  at  the  mean  =  +1911  hours 
Variables 

88  Computer  hours 

11  Number  of  instructions  in  original  estimate  (in  thousands) 

10  Complexity  rating  (scale  1-5) 

16  Number  of  words  in  data  base  (log^) 

Figure  2  is  a  comparison  of  actuals  versus  computed  values  for  this  equation. 
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COMPUTER  HOURS 


Figure  2.  Actuals  vs  Computed  for  Cost  Variable  88 — Total  Computer 
Hours  (N  =  26) 


It  is  apparent  that  variables  10,  11,  and  16  are  components  in  both  equations. 
In  fact,  an  analysis  of  cost  variable  intercorrelations  revealed  that  man 
months  and  computer  hours  had  a  correlation  of  .97;  thus,  it  seems,  one  can 
be  predicted  from  the  other.  Figure  3  provides  a  scatterplot  and  a  simple 
regression  equation  showing  the  relationship  between  these  variables: 
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Figure  3*  Total  Computer  Hours  vs  Man  Months  for  Program  Design, 
Code  and  Test  (N  =  2 6) 

NOTE:  Data  point  5  is  plotted  here  although  it  was  not  used  in  the 
derivation  of  the  equation  shown  above. 


The  above  relationship,  if  supported  in  continued  analysis,  implies  that  the 
problem  of  estimating  computer  time  is  reduced  to  the  problem  of  estimating 
man  months 
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Apparent  in  most  of  the  early  equations  we  derived  was  the  dominance  of 
predictor  variable  11,  estimated  number  of  instructions,  while  other  variables 
seemed  to  be  playing  relatively  minor  roles.  We  suspected  that  the  remaining 
two  large  data  points  (4  and  6)  in  the  sample  were  heavily  influencing  this 
condition  due  to  their  size  and  the  accuracy  with  which  they  have  been 
estimated.  Therefore,  we  performed  an  additional  analysis  on  cost  variable 
84  (man  months),  omitted  data  points  4,  5>  and  6,  and  substituted  variable  90 
(delivered  instructions)  for  variable  11  (estimated  instructions) .  The  results, 
shown  below  and  detailed  in  Table  V,  do  indeed  demonstrate  a  decreased  emphasis 
on  number  of  instructions,  and  an  increased  significance  of  other  variables: 

*  2’8x90  +  +  33X39  '  17X59  +  10x46  *  hz  '  188 

Standard  error  of  estimate  =  70  m/m 

95$  confidence  limit  at  the  mean  =  4-150  m/m 
Variables 

84  Man  months  for  program  design,  code,  and  test 
90  Delivered  instructions  (in  thousands) 

83  Trip  mileage  (thousands) 

39  External  document  types 

59  Type  IV1  programmer  experience 

46  Number  of  displays 

12  Percent  new  instructions 

A  comparison  was  made  of  actual  man  months  versus  man  months  computed  from 
the  preceding  equation.  Figure  4  shows  a  marked  decreased  in  the  residuals, 
thus  providing  a  visual  illustration  of  the  increased  confidence  that 
characterizes  this  equation. 


^Type  IV,  the  System  Programmer,  contributes  to  the  formulation,  planning, 
design,  and  development  of  large  computer  program  systems.  A  more  complete 
definition  of  programmer  types  is  included  on  page  20  of  the  questionnaire. 
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Figure  4.  Actuals  vs  Computed  for  Cost  Variable  84— Man  Months  for 
Program  Design,  Code  and  Test  (N  =  24) 

NOTE:  This  was  the  final  analysis,  using  delivered  instructions  in 
place  of  estimated  instructions  and  excluding  oil  extremely 
large  programs. 
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TABLE  V 

SUMMARY  OF  CORRELATION  AND  REGRESSION  ANALYSIS  FOR  COST  VARIABLE  84 
Man  Months  for  Program  Design,  Code  and  Test 
(Final  analysis — using  variable  90  and  excluding  all  extremely  large  programs) 


Variable  Number 

90 

83 

39 

59 

1*6 

12 

30 

24 

Del’d 

Trip 

Ext. 

T  A 

$  Data 

Short  Description 

Instr. 

Mileage 

Docts. 

Progr. 

No.  of 

$  New 

Prog. 

Reduc. 

(1000's) 

(1000's) 

(Types) 

Exper. 

Displays 

Instr. 

Chngs. 

Instr. 

Means 

39-7 

60.6 

^•7 

3.7 

3-1 

82.2 

23.1 

30.0 

Standard  Deviations 

26.9 

77.3 

2.5 

3-3 

6.0 

26.3 

36.8 

19.6 

Validity  Coefficients 

•  58 

.68 

.37 

-.10 

.68 

.20 

.67 

-.22 

Intercor relations 

Variable  Number 

90 

1.00 

.17 

.17 

.20 

.54 

-.14 

.08 

.13 

83 

.17 

1.00 

.14 

-.02 

.26 

.15 

.58 

-.33 

39 

.17 

.11* 

1.00 

.52 

.07 

-•17 

.42 

-.42 

59 

.20 

-.02 

.52 

1.00 

-.14 

-.39 

-.09 

-.29 

46 

•  54 

.26 

.07 

-.14 

1.00 

.05 

.40 

-.03 

12 

-.1  h 

.15 

-.17 

-.39 

.05 

1.00 

.01 

.15 

30 

.08 

.58 

.42 

-.09 

.40 

.01 

1.00 

-.32 

24 

.13 

-.33 

-.42 

-.29 

-.03 

.15 

-.32 

1.00 

Standardized  Regression 

.hi 

•  3** 

.34 

-.35 

.26 

.17 

.12 

-.19 

Coefficients  (ll  variables)* 

Standardized  Regression 

.35 

.!*7 

.38 

-.27 

•  30 

.12 

not 

not 

Coefficients  (6  variables) 

se¬ 

se¬ 

lected 

lected 

Mean  of  Cost  Variable 

203 

Number  of  Data  Points 

24 

Multiple  Correlation  Coefficient  .9 6 

Standard  Deviation  of  Cost 

Variable  212 

Standard  Error  of  Prediction 

71 

Standard  Error  of  Estimate 

70 

at  the  Mean 

9 5$  Confidence  Limits 

at  the  Mean** 

+150  Man  Months 

PREDICTION  EQUATION: 

=  2.8^  +  1.3*83  +  33X39  ‘  17X59  +  10X46  +  *12 

-  188 

*There  were  11  variables  in  the  original  selection  run.  Variables  2 6  ($  Decision  Instr.), 
32  (Language  Type)  and  42  (Cptr.  Oper.  Doct’d)  were  also  not  selected  due  to  extremely 
small  standardized  regression  coefficients. 

**These  limits  will  expand  as  predictions  deviate  from  the  mean. 
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Instruction-Estimating  Equation 


Since  the  intermediate  predictor  variable,  number  of  instructions,  played 
such  a  significant  role  in  this  analysis,  it  is  especially  worthy  of  addi¬ 
tional  study.  Even  though  the  smaller  sample  (N  =  24)  analysis  tended  to 
reduce  the  contribution  of  this  variable,  a  reliable  technique  is  still 
needed  to  ascertain  this  quantity.  This  need  is  emphasized  again  in 
Figure  5,  which  depicts  the  relationship  between  man  months  and  instructions, 
and  Figure  6,  which  shows  the  relationship  between  computer  hours  and 
instructions. 


Figure  5*  Man  Months  for  Program  Design,  Code  and  Test  vs  Number  of 
Delivered  Program  Instructions  (N  =  26) 

NOTE:  Data  point  5  is  plotted  here  although  it  was  not  used  in  the 
derivation  of  the  equation  shown  above. 


Figure  6.  Total  Computer  Hours  vs  Number  of  Delivered  Program 
Instructions  (N  =  26) 

NOTE:  Data  point  5  is  plotted  here  although  it  was  not  used  in  the 
derivation  of  the  equation  shown  above. 

Shown  below  are  the  results  of  a  special  analysis  conducted  to  derive  an 
equation  for  estimating  the  total  number  of  delivered  instructions  without 
using  the  reported  estimate  of  this  number  as  a  component  in  the  equation. 

=  2.6^  -i-  1.25^  +  5.63^  -  13.9 

Standard  error  of  estimate  =  25*7  Inst.  (Thousands) 

95$  confidence  limit  at  the  mean  =  _+5^.2  Inst.  (Thousands) 


Variables 

90 

Number  of  delivered  instructions  (in 

thousands ) 

18 

Number  of  input  message  types 

21 

Number  of  subprograms 

13 

Number  of  words  in  tables  and  constants  (log^) 

hi 


Figure  7  shows  a  plot  of  actual  versus  computed  number  of  instructions  result¬ 
ing  from  the  application  of  the  equation  to  26  data  points.  Additional 
detail  is  provided  in  Table  8  of  Appendix  VII. 
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Vertical  Deviations  f 
are  Residuals  i 


Figure  7.  Actuals  vs  Computed  for  Cost  Variable  90 — Delivered 

Instructions  (in  Thousands)  (Without  Using  Estimated  Instructions, 
N  =  26) 
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As  might  be  expected,  the  preceding  equation  has  rather  broad  confidence 
limits.  We  believe  this  condition  stemmed,  in  part,  from  the  original  form¬ 
ulation  of  the  present  research.  At  that  time,  very  little  thought  was  given 
to  variables  that  directly  affect  number  of  instructions,  so  that  the  predictions 
shown  in  that  equation  are  not  necessarily  the  most  realistic  indicators  of  this 
parameter.  Other  more  fundamental  factors  must  be  formulated  to  describe  more 
specifically  the  nature  of  the  data  processing  task  to  be  performed.  At  any 
rate,  the  dominance  of  number  of  instructions  in  this  analysis  provides  a 
strong  stimulus  for  a  deeper  investigation  into  the  underlying  factors 
associated  with  program  size. 

Summary  and  Conclusions 


One  relationship  in  which  we  can  now  begin  to  have  increasing  confidence  is 
that  between  costs  and  the  number  of  instructions  in  the  completed  program. 

In  the  sample  of  26  data  points  (with  data  point  5  removed),  cost,  in  terms 
of  man  months  and  computer  hours,  was  primarily  related  to  program  size 
(instructions)  and  less  influenced  by  other  factors.  Using  reported  estimates 
of  instructions  alone  as  a  predictor  of  man  months  for  program  design,  code 
and  test,  we  obtained  95  percent  confidence  limits  of  383  man  months  at  the 
predicted  mean  of  the  cost  variable.  By  adding  rated  program  complexity, 
external  document  types,  internal  document  types  and  number  of  data  base 
words  to  the  equation,  the  confidence  limits  were  decreased,  and  the  statis¬ 
tical  confidence  was  increased  by  23  percent.  This  suggested  that  the  use  of 
suitable  predictor  variables  other  than  number  of  instructions  would  help  to 
increase  cost- estimating  precision. 

A  substantial  reduction  in  the  95  percent  confidence  limits  for  estimating 
man  months  was  achieved  by  eliminating  all  the  extremely  large  programs  (data 
points  4,  5  and  6)  from  the  regression  and  using  variable  90  (delivered 
instructions)  rather  than  variable  11  (reported  estimated  instructions)  as 
a  key  predictor  variable.  This  resulted  in  the  selection  of  five  companion 
predictor  variables  that  provided  an  enhanced  intuitive  quality  to  the 
equation  and  increased  the  confidence  in  the  final  equation  considerably. 
Specifically,  the  variables  trip  mileage,  external  document  types,  Type  IV 
programmer  experience,  number  of  displays,  and  percent  new  instructions,  when 
combined  with  delivered  instructions,  reduced  the  confidence  limits  to  150 
man  months,  a  reduction  of  60  percent  from  those  originally  calculated.  This 
is  a  strong  indication  that  an  appreciable  increment  in  cost- estimating  pre¬ 
cision  can  be  expected  from  the  use  of  multiple  predictor  variables. 

However,  despite  the  dominance  of  number  of  instructions  in  our  present 
research,  it  is  only  an  intermediate  cost-estimating  parameter,  not  a  measure 
of  programming  quality  or  program  performance,  and  therefore,  is  not  useful 

as  a  cost-effectiveness  measure.  To  measure  cost  effectiveness,  information 
concerning  important  but  presently  unmeasurable  design  characteristics  such 
as  a  program's  data-processing  capability,  complexity,  reliability,  usability 
and  changeability  will  be  needed. 
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It  is  not  recommended  that  program  development  efforts  be  compared  solely  on 
the  basis  of  cost  per  instruction.  Perhaps  an  analogy  to  a  more  everyday 
example  will  make  this  important  point  cleaner.  A  station  wagon  and  a  sports 
car  that  cost  equal  amounts  may  have  an  equal  number  of  engine  cylinders,  but 
the  value  and  performance  of  these  two  vehicles  can  be  clearly  distinguished 
in  terms  of  fuel  consumption,  acceleration,  design  for  family  use  or  sports- 
car  use,  and  so  forth.  Comparison  of  these  cars  on  a  cost-per-cylinder  basis 
is  virtually  meaningless,  which  is  our  point  concerning  cost  per  instruction. 

Since  computer  hours  and  man  months  were  closely  related  in  both  the  26-data- 
point  study  (r  =  .97)  and  the  24-data-point  study  (r  =  .9l)>  it  is  anticipated 
that  similar  findings  will  prevail  concerning  these  major  cost  variables. 

Such  findings,  if  substantiated  in  further  studies,  would  provide  a  firm 
foundation  for  improving  our  cost-estimating  techniques. 

The  results  of  the  analysis  of  cost  factors  by  statistical  techniques 
illustrate  clearly  that  meaningful  relationships  among  both  the  factors 
and  the  costs  can  be  derived.  Such  relationships  can  be  determined  with 
much  more  accuracy  and  validity  by  extending  the  analysis  to  larger  samples 
of  data  and  by  probing  more  deeply  into  the  fundamental  nature  of  the  data- 
processing  task. 

This  study  has  been  a  first  attempt  to  quantify  the  cost-contributing  effects 
of  some  of  the  factors  believed  to  affect  programming  costs.  Work  must  be 
initiated  in  certain  other  areas  if  programming  managers  are  to  obtain  a 
better  understanding  of  the  problems  of  costing,  evaluating  and  comparing 
computer  programs.  The  next  section  outlines  some  directions  in  which  the 
present  research  may  be  extended. 


V.  RECOMMENDATIONS  FOR  FUTURE  WORK 

In  addition  to  recommendations  for  a  continuation  of  the  cost  analysis  along 
the  lines  described  in  this  report,  we  discuss  here  a  number  of  problem  areas 
appropriate  for  future  research. 

Systematic  iteration  of  the  activities  of  data  collection  and  analysis  is  a 
necessary  condition  for  achieving  useful  cost-estimating  relationships.  For 
example,  many  of  the  predictor  variables  rejected  early  in  the  study  still 
hold  great  appeal  and  require  further  study  to  determine  their  utility  in 
cost  regression  equations.  Some  of  the  rejected  variables  that  have  high 
logical  appeal  are  listed  in  Table  VI. 
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TABLE  VI 

SOME  REJECTED  VARIABLES  THAT  REQUIRE  FURTHER  STUDY 


Variable 

Number 

Correlation* 
with  Man 
Months  (84) 

Short  Variable  Description 

Comments 

4o 

.88 

Total  number  of  computer  hours 
per  week 

High  correlation,  but 
considered  a  feedback 
variable  rather  than 
a  true  predictor 

19 

o 

CO 

• 

Number  of  output  message  types 

Very  highly  correlated 
with  16,  number  of 
words  in  data  base; 

17,  number  of  data 
base  classes;  and  18, 
number  of  inputs 

30 

.78 

Number  of  program  design  changes 

Difficult  to  estimate; 
correlated  with  83, 
trip  miles 

17 

.70 

Number  of  data  base  classes 
(lo6i0) 

A  possible  alternate 
for  16,  number  of  words 
in  the  data  base  (log^Q 

33 

•  56 

Number  of  programming  tools 

Needs  better  descrip¬ 
tion  of  tools;  possibly 
a  feedback  variable 

6 

.34 

Number  of  system  design  changes 

Difficult  to  estimate 

7  6 

•  30 

Number  of  agencies  required  for 
concurrence 

Seems  to  be  tied  to 

77  and  78,  experience 
and  decision  capability 
of  agencies 

32 

-.30 

Language  type  used 

Possibly  spurious 
algebraic  sign 

1 

.17 

Innovation  in  operational  system 

Needs  better 
definition  of 
innovation 

*These  coefficients,  based  on  2 6  data  points,  changed  significantly  when 
all  the  extremely  large  data  points  were  removed.  See  the  Validity  Table 
for  N  =  24  in  Appendix  V. 
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We  consider  the  techniques  of  regression  analysis  and  factor  analysis  to  be 
particularly  robust  and  suitable  tools  with  which  to  continue  the  research. 

As  a  result  of  the  experience  gained  in  this  the  first  iteration,  we  feel 
that  we  have  a  sound  basis  for  improving  the  initial  design  of  the 
questionnaire  and  for  collecting  data  to  form  a  larger  and  more  representative 
sample  of  program  development.  Specifically,  in  the  immediate  continuation 
of  the  cost  analysis,  we  need  a  sample  size  of  at  least  one  hundred  data  points. 
This  iteration  has  shown  the  feasibility  of  the  basic  approach;  the  next  one, 
based  on  a  sufficiently  large  sample,  should  result  in  estimating  equations 
with  higher  reliability  and  validity. 

Additional  Techniques 

In  addition  to  the  above  recommendations  for  more  satisfactory  data  collection 
and  analysis,  the  continuation  of  our  work  might  benefit  from  the  application 
of  the  following  techniques,  which  time  did  not  permit  us  to  use. 

1.  Using  a  Modified  Step-Wise  Regression  Analysis  to  Select  Predictor 
Variables.  When  the  ratio  of  potential  predictor  variables  to  observed 
data  points  approaches  or  exceeds  one  (and  the  sample  is  relatively  small) 
there  is  considerable  risk  that,  as  the  population  of  variables  is 
reduced  to  enable  the  computation  of  a  meaningful  regression  function, 
some  useful  variables  may  be  overlooked.  One  positive,  although 
incomplete,  method  for  reducing  this  risk  is  to  select  predictor 
variables  by  analyzing  the  correlation  coefficients  of  all  variables 
with  the  successive  residuals  resulting  after  the  influence  of  the  best 
single  prior  variable  has  been  removed  statistically.  This  approach  is 
known  as  stepwise  regression  analysis  and  may  be  used  successfully  when  a 
dependable  and  dominant  predictor  variable  is  available  as  a  core  around 
which  to  build  the  analysis  (the  variable  called  number  of  instructions 
appears  to  be  this  kind  of  a  variable). 

Computer  programs  for  conducting  stepwise  regression  usually  choose  the 
highest  partial  validity  coefficient  at  each  successive  step  in  the  selec¬ 
tion  process.  However,  in  a  modified  version  of  this  approach,  investigators 
can  examine  the  results  before  each  selection  is  made.  In  this  way,  the 
investigators  may  override  the  automatic  selection  when  necessary  and 
choose  a  selection  sequence  that  best  meets  operational  criteria.  At  the 
same  time,  they  can  also  observe  and  tag  promising  predictor  alternates 
for  analysis  by  conventional  regression  procedures . 

2.  Questionnaire  Item  Analysis  and  Aggregation.  One  alternative  available 

to  minimize  information  loss,  when  there  are  many  more  predictor  variables 
than  data  points  in  the  sample,  is  the  systematic  aggregation  of  variables 
into  homogeneous  groups.  This  device  is  especially  sui table  when  many  of 
the  variables  are,  in  fact,  dichotomous  questionnaire  items,  i.e.,  items 
that  can  be  answered  YES  or  NO.  If  such  items  can  be  meaningfully  scored 


46 


1  or  0,  they  can  be  grouped  into  submeasures  that,  in  turn,  could  be 
handled  as  variables.  Aggregation  decreases  the  initial  population  of 
variables  and  thus  allows  a  more  favorable  data  point-to-variable  ratio, 
while  preserving  more  of  the  information  in  the  questionnaire. 

All  dichotomous  questionnaire  items  need  not  be  aggregated — some  may  be 
ignored  as  a  result  of  poor  observed  validity,  i.e.,  low  correlation  with 
particular  cost  variables.  Such  items  may  be  subject  to  exclusion  from 
subsequent  versions  of  the  questionnaire  on  empirical  grounds;  however, 
contrary  to  the  emphasis  placed  on  eliminating  redundant  variables  in 
regression  analysis,  items  that  are  valid  and  highly  int ere orr elated  may 
be  kept  to  enhance  the  internal  consistency  and  reliability  of  the  ques¬ 
tionnaire.  In  fact,  the  int ercorrelat ions  among  items  can  be  used  to 
identify  clusters  that,  in  turn,  help  define  the  number  and  nature  of  the 
submeasures.  Therefore,  the  use  of  item  analysis  and  aggregation  in 
follow-on  research  may  lead  to  new  and  valid  predictor  variables. 

3.  Program  Cluster  Analysis  by  Using  Inverted  Factor  Analytic  Techniques. 
There  are  two  major  types  of  statistical  factor  analysis.  One  attempts 
to  describe  a  complex  of  descriptive  variables  in  terms  of  a  reduced  set 
of  underlying  factors.  This  is  the  conventional  factor  analysis  and  the 
one  that  was  used  to  some  extent  in  this  research.  There  is  another 
method  of  factor  analysis,  called  inverted  factor  analysis  or  Q- Technique 
(13),  which  is  concerned  with  the  manner  in  which,  not  variables,  but 
data  points  can  be  clustered  and  described  more  parsimoniously.  The  aim 
of  such  analysis  would  be  to  isolate  and  classify  the  basic  types  of 
computer  program  development  efforts.  Although  inverted  factor  analysis 
was  not  employed  in  the  current  research,  it  appears  to  offer  additional 
potential  for  determining  whether  programming  systems  can  be  grouped  into 
homogeneous  families  and,  therefore,  it  could  become  a  valuable  tool  for 
investigating  program  system  taxonomy. 

Related  Research  Areas 


Many  other  program  development  areas  require  research.  In  the  following,  we 
review  briefly  several  of  these.  We  feel  that  research  here  will  be  of 
inestimable  value  to  programming  managers  and  purchasers  of  programming 
products.  In  general,  all  of  the  suggestions  are  pointed  toward  providing 
a  cost/value  framework  for  managerial  decision-making  with  respect  to 
computer  program  development. 

1.  Development  of  Techniques  for  Estimating  Program  Size.  Since,  in  this 
analysis,  program  size  as  measured  by  number  of  instructions  had  such  a 
strong  relationship  to  costs,  a  reliable  technique  to  estimate  size  is 
sorely  needed.  One  estimator,  described  in  Figure  J,  was  developed  by 
using  regression  analysis.  However,  this  formula  still  has  rather  broad 
confidence  limits.  A  related  estimator  was  partially  investigated  and 
is  described  in  an  SDC  document  (Reference  l4) .  In  that  document,  a 


well-defined  relationship  between  program  design  requirements  and  number 
of  program  instructions  was  hypothesized.  More  specifically,  the  research 
hypothesized  a  relationship  between  the  number  of  operational  decisions 
contained  in  the  program  requirements  and  the  number  of  decision  class 
instructions,  and  then,  in  turn,  a  relationship  between  the  number  of 
decision  class  instructions  and  the  total  number  of  instructions.  The 
former  relationship  has  never  been  investigated;  only  the  latter  relation¬ 
ship  was  examined.  However,  the  hypothesis  of  the  relationship  between 
decision  class  instructions  and  total  instructions  was  tentatively 
supported  in  a  frequency  analysis  of  machine  instructions. 

2.  Development  of  Techniques  for  Estimating  the  Cost  of  Programming  Changes. 
Research  to  provide  estimates  for  the  cost  of  changes  would  be  highly 
dependent  on  the  results  of  work  in  improving  cost- estimating  techniques. 
Therefore,  as  a  sequel  to  this  study,  an  extension  could  be  conducted  to 
search  for  prediction  methods  that  provide  cost  estimates  in  replanning, 
e.g.,  when  changes  in  requirements  are  proposed.  Since,  in  such  cases, 
some  program  development  work  has  already  been  done  and  the  total  job  is 
more  clearly  defined,  the  predictions  would  have  to  be  more  accurate  than 
those  acceptable  for  an  initial  estimate.  More  details  would  be  required 
in  the  statement  of  factors  that  influence  the  cost  of  changes.  Addition¬ 
ally,  better  techniques  would  be  needed  to  account  for  the  requirement 
imposed  by  the  need  to  modify  work  already  completed. 

3.  Development  of  a  Taxonomy  of  Computer-Based  Information-Processing 
Systems .  A  basic  need  for  managers,  users,  and  researchers  is  a  more 
systematic  classification  of  both  completed  and  projected  work  in 
information  processing.  With  the  rapid  development  of  new  tools,  tech¬ 
niques,  and  applications  in  information  processing,  even  the  most  advanced 
students  in  the  field  struggle  to  keep  abreast  of  the  technology.  Part  of 
this  problem  is  the  lack  of  a  structure  into  which  new  developments  can  be 
placed  to  allow  comparison  with  past  efforts. 

To  alleviate  this  problem,  it  would  be  necessary  to  develop  a  comprehensive 
taxonomy  or  a  series  of  taxonomies.  These  classification  schemes  would 
provide  generalized  distributions  (devoid  of  acronyms)  along  several 
dimensions,  such  as  functions  performed,  design  characteristics,  develop¬ 
ment  procedures,  cost,  elapsed  time,  and  staffing.  In  addition  to  the 
intrinsic  worth  of  such  taxonomies  for  relating  various  information¬ 
processing  developments,  they  could  also  provide  a  basis  for  collection 
of  data  concerning  cost,  performance  and  lead  time  for  use  in  cost 
effectiveness  studies.  Additionally,  they  could  possibly  be  used  to 
develop  a  benchmark  as  an  aid  to  improved  qualitative  comparison  of  the 
nonhardware  portions  of  information- processing  systems. 
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4.  Development  of  Descriptors  of  Program  Performance  and  Quality,  In  this 
task,  researchers  need  to  clarify,  define,  and  determine  measurements 
relating  to  the  quality  of  computer  programs  and  to  program  documentation. 
This  area  of  work  overlaps  the  cost  work  described  previously  as  well  as 
the  effort  toward  an  information-processing  taxonomy. 

A  deeper  investigation  of  quality  should  consider: 

a.  What  programs  are  supposed  to  do  and  how  they  are  intended  to  be  used 
as  reflected  in  requirements  and  design  specifications. 

b.  What  programs  actually  do  as  determined  by  test,  exercise  and 
operational  use. 

c.  Ways  in  which  desired  quality,  including  performance  characteristics, 
can  be  expressed  unambiguously  and  preferably  quantitatively,  and  how 
the  products,  both  documents  and  programs,  can  be  inspected  during 
each  programming  activity  to  insure  that  quality  standards  are  met. 

At  present,  the  only  measurable  characteristics  that  are  generally  used 
to  describe  programs  are  computer  operating  time  and  program  size  or 
storage  requirements.  Although  programs  are  classified  by  titles  such 
as  "storage  and  retrieval,"  or,  at  a  lower  level  in  the  hierarchy,  "input 
format  conversion, "  there  is  no  set  of  descriptors  that  permits  easy 
comparison  of  programs  for  planning  purposes  and,  more  important,  for 
cost  estimation.  In  addition,  there  is  the  need  to  assign  more  meaning 
to  expressions  such  as  usability,  modularity  and  maintainability  as  they 
apply  to  specific  program  design  characteristics  and  as  they  apply  to 
the  way  programs  are  used.  The  descriptors  of  performance  and  quality 
discussed  here  are  intended  to  alleviate  both  the  problem  of  unambiguous 
requirement  specification  and  the  quality  control  problem  of  testing 
programs  so  that  errors  can  be  efficiently  detected  and  corrected. 

5.  Development  of  Cost  Trade-offs  and  Cost ^Value  Relationships.  The  above 
studies  of  cost,  quality  and  performance  are  all  pointed  toward  cost- 
effectiveness  analysis.  In  cost-effectiveness  research,  appropriately 
derived  cost-estimating  relationships  and  measures  of  quality  and 
performance  could  be  used  to  construct  techniques  that  permit  quantitative 
comparisons  of  proposed  new  products,  tools  and  procedures.  The  research 
should  seek  the  identification  of  preferred  ways  to  develop  and  design 
nonhardware  components  based  upon  sound  economic  principles.  For  example, 
in  computer  program  development,  various  trade-offs  could  be  considered 
with  respect  to  program  design  and  performance,  personnel  mix,  organiza¬ 
tion,  scheduling,  quality  control  practices,  documentation  design,  and 
computer  usage. 
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GUIDE  TO  APPENDICES 


I  QUESTIONNAIRE 

Primary  data  gathering  instrument.  The  cost  factors  of  Volume  I 
are  rephrased  in  the  form  of  questions  in  an  attempt  to  quantify 
these  variables. 

II  DEFINITION  OF  VARIABLES 

The  items  in  the  questionnaire  are  then  rephrased  as  the  predictor 
and  cost  variables  that  are  analyzed  in  this  investigation. 

Ill  DATA  MATRIX 

The  responses  to  the  questionnaire  are  tabulated  by  variable  and 
data  point.  Twenty- seven  data  points  are  described. 

IV  DATA  ACCURACY 

An  assessment  by  the  responders  to  the  questionnaire  of  the  accuracy 
accuracy  with  which  44  key  questions  were  answered. 

V  VALIDITY  TABLES 

The  correlations  for  all  predictor  variables  with  nil  cost  variables 
are  tabulated  for  both  analyses  of  N  =  26  and  N  *  24. 

VI  FACTOR  LOADINGS 

The  results  of  the  rotated  factor  loadings  are  tabulated  for  N  =  26. 

VII  SUMMARY  OF  REGRESSION  EQUATIONS 

All  the  cost-estimating  equations  derived  in  this  analysis  are 
summarized  and  statistical  details  are  tabulated  such  as  the 
means,  standard  deviations,  correlations,  weighted  and  stand¬ 
ardized  regression  coefficients,  standard  error  of  estimate,  and 
confidence  limits. 
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APPENDIX  I— COST  ANALYSIS  QUESTIONNAIRE 


INSTRUCTIONS  FOR  COMPLETING  QUESTIONNAIRE 

This  questionnaire  is  a  means  for  collecting  data  on  past  programming 
efforts •  These  data  will  help  us  to  identify  and  verify  key  factors 
affecting  the  cost  of  computer  programs .  We  are  seeking  to  increase 
the  reliability  of  techniques  for  estimating  costs  of  program 
development . 

The  questionnaire  is  organized  into  seven  parts.  The  first  part, 
when  completed,  is  an  assignment  sheet  outlining  the  division  of  your 
program  system  or  contract  into  program  data  points  as  defined  below. 

A  short  description  of  each  program  corresponding  to  a  data  point  is 
also  requested.  The  six  remaining  parts  are  questions  concerning  some 
sixty-five  factors  that  affect  the  cost  of  computer  programs. 

These  factors  have  been  organized  into  the  following  six  parts. 

.  Operational  Requirements  and  Design 

.  Program  Design  and  Production 

.  Data  Processing  Equipment 

.  Programming  Personnel 

.  Management  Procedures 

.  Development  Environment 


Generally,  speaking,  the  first  two  categories  address  the  question, 
"What  was  the  job  to  be  done?"  The  next  two  ask,  "What  were  the 
available  resources?"  and  the  last  two  examine,  "What  was  the  nature 
of  the  working  environment?"  Some  of  the  factors  may  be  specified 
or  estimated  readily  by  you,  whereas  many  required  that  we  develop 
arbitrary  rules  and  definitions  (since  there  are  no  standards), 
before  these  factors  could  be  used.  After  each  of  the  six  categories 
of  questions  is  a  general  question  soliciting  comments .  Here  we  would 
be  especially  interested  in  any  historical  data  that  might  have  impact 
on  the  answers  provided. 

The  information  we  are  seeking  is  fairly  detailed  and  most  likely  will 
not  be  readily  available.  Therefore,  since  some  effort  will  be  in¬ 
volved  in  compiling  these  data,  we  have  attempted  to  make  the  questions 
as  clear  and  definitive  as  possible.  Even  so,  some  of  our  definitions 
in  the  questionnaire  are  necessarily  arbitrary  and  in  some  cases  may 
be  difficult  to  apply.  We  encourage  answering  all  questions  even  if 
you  have  to  redefine  terms  to  suit  the  information  available  to  you. 
When  you  find  this  to  be  necessary,  please  help  us  by  giving  a  brief 
rationale  for  this  change. 
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One  problem  in  collecting  data  on  computer  programs  is  the  definition 
of  a  program  in  terms  of  bounds  on  the  program  being  examined.  The 
definition  leads  us  to  the  concept  of  a  data  point.  We  require  the 
concept  of  a  data  point  to  standardize  the  definition  of  a  program  in 
order  to  better  understand  what  it  is  we  are  trying  to  compare  in  our 
final  analysis .  The  answers  to  this  questionnaire  will  then  allow 
the  comparisons  to  be  made  on  a  more  rigorous  basis.  One  complete 
questionnaire  is  required  for  each  program  corresponding  to  a  data 
point.  We  will  need  your  help  in  identifying  data  points  in  accor¬ 
dance  with  the  following  definition. 

A  program  data  point  is  the  smallest  set  of  computer  program  instructions 

(1)  whose  purpose  is  defined  by  someone  other  than  the  programmer, 

(2)  which  is  delivered  to  the  user  (customer)  as  a  package,  and 

(3)  which  is  loaded  into  the  computer  as  a  program  unit  or  system 

to  achieve  the  stated  purpose  or  objective. 

By  this  definition,  a  program  data  point  can  be  an  operational  program, 
a  utility  program,  or  even  an  experimental  program.  These  are  clearly 
not  limited  to  any  specific  function.  Similarly,  the  user  of  the  program 
(represented  by  the  data  point)  may  be  the  buyer,  but  he  may  also  be 
another  programmer,  as  in  the  case  of  a  utility  program.  The  responder 
must  keep  in  mind  at  all  times  the  portion  of  the  program  that  he  is 
calling  the  program  data  point  when  answering  the  questions .  For 
example,  a  program  data  point  as  defined  here  could  be  a  specific 
package  in  SATING  or  a  part  of  a  model  in  SAGE*-  (e.g. ,  Model  9,  D.C.), 
or  a  phase  in  NORAD,  an  independent  system  such  as  ECAP9*  in  DODDAC 
or  a  subsystem  in  SACCS.* 

Additionally,  the  definition  of  a  program  data  point  necessarily 
includes  some  clear  statement  of  limits  to  the  scope  of  activities 
considered  as  part  of  the  programming  process.  Here,  we  are  con¬ 
cerned  with  the  activities  of  program  design,  code,  test,  and 
documentation . 

A  summary  form  is  included  to  summarize  the  major  costs  of  the  program 
being  examined  in  terms  of  man  months,  computer  hours,  and  calendar  time 
involved.  Requested  on  this  sheet,  also,  is  a  list  of  names  of  the 
persons  to  whom  the  various  parts  are  delegated.  A  summary  form  is 
attached  to  each  questionnaire. 

Finally,  we  need  your  evaluation  of  the  accuracy  of  the  data 
presented.  After  each  answer  for  which  we  require  this  evaluation, 
you  will  find  an  open  parenthesis.  By  keeping  the  following  table 
handy,  you  may  conveniently  fill  in  the  parenthesis  with  one  of  the 
code  numbers . 


*  SATIN- - SAGE  Air  Traffic  Integration 
SAGE — Semi-Automatic  Ground  Environment 
ECAPS — Emergency  Capability  System 
SACCS — Strategic  Air  Command  Control  System 
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TABLE  FOR  ACCURACY  VALUES 


(to  be  inserted  in  open  parenthesis  as  indicated  in  questionnaire) 


From  Records 

From  Memory 

Judgment 

1.  Very  accurate 

b.  Accurate  recollection 

7 .  Confident 

2.  Good  estimate 

5*  Good  guess 

8.  Good  guess 

3,  Unreliable 

6 .  Very  hazy 

9 .  Estimate 

Your  cooperation  will  be  greatly  appreciated.  If  there  are  any 
questions  at  all,  please  call  L.  Farr  on  Extension  h39  in  Santa 
Monica . 
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Decision  Making  -  Make  a  logical  choice  given  certain  conditions  (e.g.,  automatic  w 
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against  the  coding  specifications . 
This  activity  is  also  called 
debugging  or  subprogram  checkout. 
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APPENDIX  II— DEFINITION  AND  CODING  OF  VARIABLES 


INTRODUCTION 

This  Appendix  defines  the  independent  (predictor)  variables  and  the 
dependent  (cost)  variables  for  which  data  were  collected  by  means  of 
the  questionnaire  (Appendix  i).  The  first  column  indicates  the  source 
question  in  the  questionnaire  that  requests  some  measure  on  the  variable. 
The  second  column  is  a  brief  description  of  the  variable  and  third  column 
identifies  the  variable  by  a  number  for  data  processing  purposes.  The 
last  column  shows  how  the  response  to  the  question  was  coded  in  the  event 
a  nonquantitative  answer  was  required. 
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APPENDIX  III— DATA  MATRIX 


INTRODUCTION 


Data  collection  was  conducted  by  means  of  the  questionnaire  in  Appendix  I. 
Each  questionnaire,  of  which  twenty-seven  were  completed,  serves  as  a  "data 
point."*  The  responses  to  the  questionnaire  are  reported  in  this  Appendix 
in  a  matrix  form.  The  data  may  differ  from  those  in  the  completed  question¬ 
naire  for  the  following  reasons: 

(1)  Data  rounding  and  scaling. 

(2)  Transformation  to  percentages  or  ratios. 

(3)  Transformation  to  logarithms. 

(4)  Modification  as  a  result  of  a  conversation  with  the  responder. 

(5)  Omissions,  where  guesses  were  not  made,  were  estimated  by  the 
researchers. 

This  last  point,  (5),  deserves  additional  comment.  The  computer  program 
which  is  used  for  the  regression  analysis  is  not  designed  to  handle  miss¬ 
ing  data.  Therefore,  we  used  our  judgment  and  experience  to  estimate  the 
missing  values.  These  estimated  values  are  identified  by  a  parenthesis 
in  the  data  matrix. 

The  row  headings  in  Appendix  III  identify  the  "data  points"  or  programs 
being  studied  and  the  column  headings  sure  highly  abbreviated  descriptions 
of  the  variables.  Appendix  II,  a  more  complete  definition  of  the  variables 
and  their  associated  coding,  includes  (l)  the  source  question  in  the 
questionnaire  (Appendix  i),  (2)  the  variable  number,  and  (3)  the  coding 
for  the  variables  for  use  in  the  statistical  analysis  performed  by  the 
computer  programs.  Variables  eliminated  before  the  first  computer  run 
have  no  variable  number  assigned. 


*A  "data  point"  is  the  smallest  set  of  instructions: 

Si)  whose  purpose  is  defined  by  someone  other  than  the  programmer, 

2)  which  is  delivered  to  the  user  as  a  package,  and 
(3)  which  is  loaded  into  the  computer  as  a  program  unit  or  system  to 
achieve  the  stated  purpose  or  objective. 
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0 

3 

1 

2 

1 

1 

3 

18 

1 

1 

1 

1 

2 

5 

1 

0 

14 

3 

19 

1 

1 

1 

1 

2 

12 

1 

1 

1 

2 

20 

0 

1 

0 

1 

2 

10 

1 

4 

34 

5 

21 

0 

1 

0 

0 

2 

0 

0 

1 

34 

2 

22 

1 

1 

1 

1 

3 

4 

2 

5 

34 

4 

23 

0 

1 

1 

1 

3 

5 

1 

5 

34 

4 

24 

0 

1 

1 

0 

3 

5 

2 

5 

34 

2-5 

25 

1 

1 

0 

0 

3 

4 

2 

5 

34 

2.5 

26 

0 

0 

1 

1 

1 

(0) 

(o) 

(0) 

1 

3 

27 

0 

0 

1 

1 

1 

5 

2 

6 

1 

3 
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o  Complexity 


U  CO 

-p  *d 
co  C 

a  3 

M  CO 

P 

*  9 

-p  & 

CO  -P 

W  w 

Variable  Number-*  11 

>  u 

0)  -P 

S3  w 

M 

mH 

12 

$  Reused 

*  Instr. 

Log,  No. 

i— ■  Words  in 

00  Tables, 

Constants 

Jh 

1  -P 
p  co 

0  a 

U  H 

0) 

-CS.-P 

14 

•d 

<L) 

'd 

u 

d 

0  • 
c n  U 
•H  -P 

P  ca 

^  M 

15 

Log,  No. 

^  Words 

Data  Base 

Log,  No. 

J-J  Classes  in 

Data  Base 

Data  2. 

4l 

100 

0 

4.83 

9 

0 

0.00 

0.00 

Point 

2 

66 

100 

0 

4.50 

10 

0 

0.00 

0.00 

3 

71 

12 

88 

4.60 

11 

0 

0.00 

0.00 

4 

238 

97 

3 

7.81 

7 

53 

8.19 

3.76 

5 

150 

92 

8 

0.00 

0 

33 

0.00 

0.00 

6 

270 

96 

4 

7.08 

3 

0 

7.68 

3.91 

7 

20 

6o 

4o 

4.28 

0 

58 

3-64 

O.85 

8 

28 

8o 

20 

4.32 

0 

0 

3.66 

0.84 

9 

16 

40 

6o 

4.34 

0 

0 

3.70 

O.85 

10 

14 

100 

0 

3.5  6 

0 

30 

0.00 

0.00 

11 

4-3 

100 

0 

1.00 

0 

0 

0.00 

0.00 

12 

17 

100 

0 

2.60 

0 

0 

(4. 60) 

2.12 

13 

15 

100 

0 

3.48 

0 

6 

6.04 

2.01 

14 

4o 

47 

53 

4.4o 

19 

0 

5.98 

2.02 

15 

6o 

100 

0 

4.30 

0 

300 

2.18 

0.30 

16 

10 

48 

52 

0.00 

'0 

0 

4.00 

1.30 

17 

15 

89 

0 

0.00 

11 

0 

0.00 

0.00 

18 

14 

62 

38 

2.90 

35 

30 

0.00 

0.00 

19 

151 

95 

5 

4.17 

73 

59 

3.48 

0.61 

20 

60 

100 

0 

4.10 

0 

73 

3.54 

0.60 

21 

45 

100 

0 

3.70 

0 

20 

0.00 

0.00 

22 

16 

59 

4l 

4.93 

0 

0 

1.40 

O.78 

23 

30 

92 

8 

4.08 

0 

0 

2.40 

0.78 

24 

15 

100 

0 

3.60 

0 

0 

2.18 

0.70 

25 

12 

100 

0 

4.00 

0 

0 

2.18 

O.78 

2  6 

4 

100 

0 

3.30 

0 

10 

0.00 

0.00 

27 

7 

100 

0 

3.30 

0 

10 

5.60 

1.48 

88 


c 


? 

1  H- 

& 

H 

1  n> 

d 

o' 

1  a> 

!  4 

I  1 

No .  Input 

00  Messages 

H  No.  Output 

^  Messages 

#  Instr. 

q  Requiring 

Innovation 

1  CO 

■§  I 

w  & 

•  0 

O  P 

a  p. 

21 

•H  ^ 
aS  -P 
-P  ‘H 

d  rH 

!•§ 

22 

*  Usability 

rl 

aS 

0 

•H 

U  • 

<D  U 

H  -P 

O  CO 

M 

23 

#  Data 

!p-  Reduction 

Instr. 

O 

•H 

-P 

a 

•H 

rd  • 

<D  U 

U  ~P 

Ph  W 

25 

Data  1 

0 

0 

21 

3 

1 

1 

20 

58 

2 

Point  g 

2 

3 

0 

27 

1 

1 

19 

11 

^5 

3 

2 

3 

84 

30 

1 

1 

20 

25 

45 

4 

73 

80 

59 

62 

1 

1 

27 

23 

15 

5 

0 

0 

27 

2 

1 

1 

5 

20 

5 

6 

4  9 

85 

100 

120 

1 

1 

10 

20 

20 

7 

3 

0 

30 

35 

1 

1 

28 

17 

25 

8 

3 

6 

4l 

35 

1 

1 

4o 

15 

15 

9 

3 

6 

87 

31 

1 

1 

35 

15 

15 

10 

21 

14 

74 

5 

0 

0 

10 

0 

0 

11 

0 

0 

0 

22 

0 

0 

30 

40 

10 

12 

5 

18 

100 

13 

1 

1 

50 

4o 

0 

13 

9 

22 

0 

16 

0 

0 

10 

60 

10 

14 

10 

50 

0 

20 

0 

0 

30 

60 

0 

15 

1 

1 

100 

5 

1 

1 

75 

0 

0 

16 

4 

8 

0 

21 

0 

0 

70 

15 

0 

17 

8 

6 

0 

12 

1 

1 

50 

50 

0 

18 

0 

0 

35 

19 

1 

1 

60 

20 

0 

19 

16 

11 

24 

23 

1 

1 

5 

50 

5 

20 

6 

9 

15 

21 

1 

1 

15 

5 

25 

21 

0 

0 

100 

30 

1 

1 

4o 

20 

0 

22 

4 

0 

30 

16 

1 

1 

10 

20 

30 

23 

3 

0 

10 

18 

1 

1 

10 

60 

10 

24 

4 

0 

51 

9 

1 

1 

10 

20 

30 

25 

6 

9 

35 

13 

1 

1 

10 

60 

10 

26 

0 

l 

0 

14 

0 

0 

80 

20 

0 

27 

2 

3 

0 

16 

1 

0 

50 

20 

0 
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Variable  Number 

* 

$  Decision 
Making 

Instr. 

Insufficient 

Memory 

^  Insufficient 

°°i/o 

tv.  Timing 

^  Constraint 

No.  of 

o Program 

Changes 

Time  Peak 

^  Program 

Changes 

lx)  Language 

13  Type 

No.  of 

oo  Program 

Tools 

Data  1 

20 

0 

0 

0 

3 

2 

1.0 

6 

Point  g 

25 

1 

1 

1 

(20) 

3 

1.0 

16 

3 

10 

1 

0 

1 

(o) 

3 

1.0 

17 

4 

35 

1 

1 

0 

136 

3 

1.0 

19 

5 

70 

0 

1 

1 

50 

1 

1.0 

8 

6 

50 

1 

1 

1 

100 

2 

1.0 

17 

7 

30 

0 

1 

1 

30 

3 

2.0 

17 

8 

30 

0 

1 

1 

70 

3 

2.0 

17 

9 

35 

0 

1 

1 

45 

3 

2.0 

17 

10 

90  ' 

0 

1 

1 

25 

3 

2.0 

11 

11 

20 

0 

0 

0 

0 

0 

2.0 

5 

12 

10 

0 

1 

0 

50 

2 

1.0 

10 

13 

20 

1 

1 

1 

15 

1 

2.0 

15 

14 

10 

1 

1 

0 

50 

2 

1.0 

12 

15 

25 

0 

0 

0 

2 

3 

1.0 

13 

1 6 

15 

1 

1 

1 

20 

3 

1.0 

(10) 

17 

0 

0 

0 

0 

(0) 

(o) 

1.0 

9 

18 

20 

0 

1 

0 

(5) 

l 

1.5 

11 

19 

40 

1 

0 

1 

7 

2 

1.0 

16 

20 

55 

1 

1 

1 

170 

2 

2.0 

14 

21 

4o 

0 

0 

0 

(o) 

(o) 

2.0 

2 

22 

4o 

1 

0 

1 

5 

1 

2.0 

9 

23 

20 

1 

0 

1 

5 

1 

2.0 

6 

24 

40 

1 

0 

0 

4 

1 

2.0 

6 

25 

20 

1 

0 

0 

5 

1 

2.0 

6 

26 

0 

0 

0 

0 

4 

2 

2.0 

7 

27 

30 

1 

0 

0 

20 

1 

2.0 

7 
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Variable  Number-1 

u 

0) 

■p 

0)  CO 

g  -P 

^  w  o1 
aJ  <l)  d) 

Ph  Eh  Ph 

34 

Assembly 

Test 

Req'ts 

-p 

aJ 

a 

•H 

§  o 

W  o 

X 

CO 

-P 

U* 

<U 

(X 

u>  Specify 

C'1  Inputs 

Outputs 

Req'ts 

ui  Specify 

When  Stop 

Testing 

No. 

^  Internal 

Document 

Types 

No. 

vo  External 

Document 

Types 

Data  i 

0 

0 

0 

0 

0 

1 

1 

Point 

2 

1 

1 

2 

1 

1 

1 

9 

3 

0 

1 

2 

1 

1 

1 

5 

4 

1 

1 

2 

1 

1 

10 

13 

5 

0 

0 

2 

0 

0 

5 

5 

6 

1 

1 

2 

1 

1 

8 

19 

7 

0 

1 

0 

0 

0 

1 

9 

8 

0 

0 

0 

0 

0 

1 

9 

9 

0 

1 

2 

0 

0 

1 

9 

10 

0 

1 

2 

1 

0 

0 

6 

11 

1 

1 

2 

0 

1 

4 

5 

12 

0 

1 

2 

1 

1 

5 

7 

13 

0 

1 

2 

1 

1 

4 

2 

14 

0 

1 

2 

1 

0 

3 

5 

15 

1 

1 

2 

1 

1 

1 

5 

16 

0 

1 

2 

0 

0 

(2) 

2 

17 

1 

1 

2 

0 

0 

4 

3 

18 

0 

1 

2 

0 

1 

5 

3 

19 

1 

1 

2 

0 

0 

11 

3 

20 

0 

0 

0 

0 

0 

5 

6 

21 

1 

0 

1 

0 

0 

6 

3 

22 

1 

1 

2 

1 

0 

l 

4 

23 

1 

1 

2 

1 

0 

l 

3 

2k 

1 

1 

1 

1 

0 

l 

5 

25 

1 

1 

1 

1 

0 

0 

5 

26 

1 

1 

2 

1 

0 

n 

2 

27 

1 

1 

2 

1 

0 

li 

2 
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icument 


u 

<u 

-p 

P  CO 

P<  <U 
•  B  -P 

O  O  -H 

sow 

Variable  Number  -►  X 

<L>  <D 
■P  Ph 

BcAJ 

Q  WO) 

4o 

Computer 

■p-time  ade- 

H quate — 

parameter 

test 

Assembly 

^  Test 

p 

W  U  tH 

X 

Computer 

j^Oper.  Adeq' 

Documented 

^Computer 

co  Design 

Interrupt 

Core 

^  Size 

(thousands) 

Data  1 

1 

7 

1 

1 

5 

1 

0 

32 

Point 

*  w  2 

1 

50 

1 

0 

15 

1 

0 

32 

3 

1 

12 

1 

1 

15 

1 

0 

32 

4 

2 

150 

0 

0 

- 

0 

1 

65 

5 

2 

25 

1 

1 

28 

1 

1 

65 

6 

2 

80 

0 

0 

- 

0 

1 

65 

7 

1 

25 

0 

0 

24 

1 

1 

69 

8 

1 

35 

1 

1 

24 

1 

1 

69 

9 

1 

45 

1 

1 

24 

1' 

0 

69 

10 

2 

9 

0 

1 

10 

0 

0 

32 

11 

2 

4 

1 

1 

8 

1 

1 

32 

12 

2 

4 

1 

1 

8 

1 

1 

32 

13 

2 

30 

0 

1 

- 

1 

0 

32 

14 

2 

13 

0 

0 

7 

0 

1 

32 

15 

2 

3 

0 

1 

- 

0 

1 

32 

16 

- 

18 

0 

0 

- 

(0) 

1 

16 

17 

2 

20 

1 

1 

- 

1 

1 

32 

18 

l4 

17 

1 

1 

- 

1 

0 

69 

19 

1 

49 

1 

1 

- 

1 

0 

32 

20 

3k 

31 

1 

1 

10 

0 

1 

24 

21 

3k 

17 

1 

1 

20 

1 

1 

24 

22 

3k 

5 

1 

1 

10 

1 

1 

24 

23 

3k 

5 

1 

1 

10 

1 

1 

24 

24 

3k 

13 

1 

1 

10 

0 

0 

24 

25 

3k 

9 

1 

1 

10 

0 

1 

24 

26 

1 

10 

1 

0 

12 

1 

1 

8 

27 

1 

20 

1 

0 

12 

1 

1 

8 
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Variable 

Number 

No.  EDP 
<57  Components 
Developed 

-p-No.  of 
Displays 

Input 

-p-  Output 

Equip. 

_ Input 

£  Only 

Equip. 

Output 

-P"  -  , 

\o  Only 

Equip. 

Pieces 

o  EAM 

Equip. 

vn  EAM 

Adequate 

-p 

CO 

•  o 

P<  P}  •  • 

£  'P 

O  <D  O  G 

O  £  CO  *H 

jH  c  aj 
Id  S 

X 

$  Time 

Lost 

*  Peripheral 

Data  i 

0 

0 

14 

1 

2 

5 

1 

Point 

2 

2 

1 

21 

0 

2 

5 

1 

5 

10 

3 

2 

1 

21 

0 

2 

5 

1 

- 

- 

4 

0 

0 

11 

(0) 

(o) 

(5) 

1 

- 

- 

5 

0 

0 

2 

2 

2 

(5) 

0 

5 

35 

6 

0 

0 

(2) 

(2) 

(2) 

(5) 

1 

- 

- 

7 

4 

6 

4l 

8 

13 

5 

1 

5 

5 

8 

0 

6 

4l 

8 

13 

5 

1 

5 

5 

9 

0 

6 

4l 

8 

13 

5 

1 

5 

5 

10 

1 

0 

15 

2 

3 

6 

0 

5 

5 

11 

0 

0 

0 

2 

1 

5 

1 

10 

5 

12 

1 

0 

15 

2 

4 

(1) 

0 

10 

5 

13 

6 

2 

15 

1 

0 

1 

0 

5 

15 

14 

2 

2 

13 

2 

2 

5 

1 

15 

1 

15 

0 

0 

12 

1 

2 

6 

1 

- 

- 

16 

0 

1 

7 

1 

1 

(5) 

1 

- 

- 

17 

0 

1 

12 

1 

2 

(5) 

1 

- 

- 

18 

0 

0 

6 

1 

2 

(5) 

1 

- 

- 

19 

5 

26 

36 

0 

30 

2 

0 

- 

- 

20 

3 

14 

7 

8 

8 

12 

1 

2 

2 

21 

(1) 

0 

2 

1 

1 

7 

1 

- 

- 

22 

0 

0 

2 

4 

2 

7 

1 

2 

2 

23 

0 

0 

2 

4 

2 

7 

1 

2 

2 

24 

0 

0 

2 

4 

2 

7 

1 

2 

2 

25 

0 

0 

2 

4 

2 

7 

1 

2 

2 

26 

16 

0 

10 

2 

2 

2 

0 

10 

20 

27 

16 

8 

12 

5 

4 

2 

0 

10 

25 

93 


w 

W 

CO  H 

CO 

H  W 

M  ^ 

H  W 

M  & 

H  W 

IV 

Exp 

Variable 

£g. 

£  & 

£  | 
tP-S* 

<D  tO 

g| 

t“*  ft 

0)  to 
ft  0 

s  ^ 

H  ft 

o>  to 
ft  0 

££ 

Type 

Prog 

Number-* 

52 

53 

5b 

55 

56 

57 

58 

59 

Data  1 

0 

33 

67 

0 

0.0 

1.0 

4.0 

0.0 

Point  ^ 

14 

29 

38 

19 

0.0 

4.0 

7.0 

8.5 

3 

0 

33 

50 

17 

0.0 

3.0 

7.5 

11.5 

4 

4o 

44 

10 

6 

1.5 

5.5 

6.5 

6.5 

5 

6 

l4 

37 

^3 

5.0 

5.0 

5.0 

5.0 

6 

25 

59 

15 

1 

3.0 

6.0 

15.0 

15.0 

7 

28 

4o 

24 

8 

0.0 

2.0 

5.5 

7.0 

8 

16 

39 

39 

6 

0.0 

2.0 

6.0 

7.5 

9 

9 

4i 

4l 

9 

0.0 

2.5 

6.5 

8.0 

10 

33 

3^ 

33 

0 

0.0 

0.0 

0.0 

0.0 

ll 

0 

70 

26 

4 

0.0 

2.5 

4.0 

5.0 

12 

0 

50 

30 

20 

0.0 

3.0 

4.0 

5.0 

13 

0 

90 

10 

0 

0.0 

0.0 

0.0 

0.0 

14 

17 

83 

00 

0 

1.3 

6.0 

0.0 

0.0 

15 

25 

38 

25 

12 

2.0 

3.0 

4.0 

5.0 

16 

0 

62 

25 

13 

0.0 

4.0 

4.o 

4.0 

17 

18 

36 

27 

19 

1.0 

2.0 

5.0 

6.0 

l8 

0 

67 

33 

0 

0.0 

3.5 

15.0 

0.0 

19 

0 

87 

13 

0 

0.0 

8.0 

12.0 

0.0 

20 

15 

70 

15 

0 

0.0 

3.0 

4.0 

0.0 

21 

8 

50 

3^ 

8 

0.0 

0.0 

0.0 

0.0 

22 

12 

50 

25 

13 

2.3 

2.3 

2.3 

2.3 

23 

12 

50 

25 

13 

2.3 

2.3 

2.3 

2.3 

24 

A  r? 

26 

59 

8 

7 

0.0 

0.0 

0.0 

3.0 

25 

26 

59 

8 

7 

0.0 

0.0 

0.0 

3.0 

26 

27 

13 

12 

37 

35 

37 

4i 

13 

12 

2.0 

2.0 

3.0 

3.0 

4.0 

4.0 

5.0 

5.0 

9b- 


Variable 

to 

to 

°  ft  C 

u  o  to 
ft  m 

£  0) 

'rft  *H  p 

Number  -*  60 

C 

U  £  CO 

$h  *H  O 

to  CO  p 

O  ft  G 

^  o  to  to 
ft  10  o 

xx.5£  £ 

6l 

c 

•H 

CO 

Jh 

to  c 

u  to  to 

ft  O  CO 

Jh  0 
ft  Q 

62 

rO 

u  £ 

°  -H  S 

£  IS 

xx  |  & 

63 

h  ci 

u  c  • 
o  °  o 

E-t  *h  a 

-p 

•  d  Jh 

o  c  o 
a  -h  ft 

64 

IQ 

O 

Jh  • 

•H  O 

a  b 

i  i 

65 

Data 

i 

33 

100 

100 

33 

0 

0 

Point 

2 

6o 

75 

70 

45 

0 

5 

3 

17 

100 

67 

50 

0 

2 

4 

29 

65 

88 

72 

3 

4 

5 

100 

100 

100 

6o 

2 

4 

6 

7 

67 

20 

55 

2 

5 

T 

0 

0 

52 

84 

2 

2 

8 

0 

0 

45 

6l 
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APPENDIX  IV— FREQUENCY  COUNT  OF  ACCURACY  RESPONSES 


INTRODUCTION 

In  an  attempt  to  determine  the  accuracy  with  which  questions  were  answered 
in  the  questionnaire,  the  following  accuracy  index  was  devised: 


TABLE  FOR  ACCURACY  VALUES 


From  Records 

From  Memory 

Judgment 

1.  Very  accurate 

U.  Accurate  recollection 

7.  Confident 

2.  Good  estimate 

5.  Good  guess 

8.  Good  guess 

3.  Unreliable 

6.  Very  hazy 

9.  Estimate 

Appendix  IV  is  a  frequency  count  of  these  responses  for  those  variables 
that  were  specifically  tagged  for  this  additional  information.  Numbers 
in  parenthesis  refer  to  question  number  in  the  event  no  variable  number 
was  assigned. 
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FREQUENCY 
(Column  numbers  refer  to 
Variables. 


COUNT  OF  ACCURACY  RESPONSES 
Accuracy  Index  Table,  page  101) 
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APPENDIX  V— VALIDITY  TABLES 


INTRODUCTION 

These  tables  present  the  correlations  of  all  predictor  variables  with 
costs.  The  first  table  presents  the  correlations  for  a  sample  size  of 
N=26  and  the  second  table  presents  the  recomputed  correlations  for  the 
analysis  with  all  the  extremely  large  data  points  removed  (N=24).  The 
reader  will  note  a  significant  change  in  the  values  of  the  correlation 
coefficients. 

While  the  cost  variables  are  defined  in  the  column  headings,  for  economy 
of  space,  the  machine  variables  are  not  defined  in  these  tables.  A 
complete  definition  of  all  variables  will  be  found  in  Appendix  II. 

These  tables  have  also  been  referred  to  in  the  text  as  the  correlation 
matrix.  The  decimal  points  have  all  been  omitted,  but  the  values  are, 
of  course,  in  hundredths. 
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-26 

-28 

01 

-24 

07 

-10 

-13 

-l6 

-25 

-26 

37 

-01 

-18 

-21 

01 

-21 

13 

24 

-08 

-10 

26 

-11 

18 

10 

16 

-11 

-04 

-08 

38 

19 

19 

30 

-42 

19 

20 

14 

09 

09 

02 

19 

-03 

-02 

l4 

12 

19 

11 

39 

37 

37 

4o 

37 

21 

30 

17 

36 

31 

27 

49 

04 

53 

46 

17 

36 

39 

4o 

69 

72 

83 

01 

54 

53 

55 

53 

50 

50 

77 

04 

68 

71 

27 

67 

58 

la 

12 

17 

09 

-10 

18 

18 

23 

-02 

11 

-38 

-10 

-62 

16 

05 

08 

13 

-04 

k2 

-13 

-07 

10 

-59 

-21 

12 

28 

-32 

-27 

-06 

-03 

-32 

-03 

16 

-38 

-15 

-29 
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VALIDITY  TABLE 
N  =  2h 


•p 

to 

4) 

✓“N 

(0 

4) 

d 

V 

Q 

bO 

3-" 

to  ^ 

tfl 

•H 

0 

to 

v  8 

bO 

2 

to 

4) 

. 

*  8 

88 

l  Mos .  Pro 
;gn,  Code, 

0 

S 

u 

4) 

•  -P 

Ph 

4) 

bp  to 

to 

«5 

d 

00 

A 

•P 

u 

w 

bO 

31 

to 

O  H 
S 

4>  H 

>  . 
*H  -P 
H  O 

&  2 
*P 

O  4) 

Jh 

*  § 

•  6 

s  S 

to  G 

&  6 
u  to 

j* 

d  bO 

b D  a> 

jd  to 

s  s 

4) 

1  6 

d  4) 

>  bO 

5  § 

•  6 

§  6 

4)  H 
-P  — ' 
C 

H  W 

|| 

4)  H 

Other 

Bonnel 

4) 

O 

.  § 
to  u 

1  u 

0 

» 

l/N 

CO 

is 

M 

-p 

0 

a 

§ 

n 

n 

•  to 

0  c 

B  M 

a  to 

3  4) 

S  Q 

e  0 

0  to 

as 

C  to 

3  4) 
2  Q 

ia 

ia 

6  fc 

S  Oh 

§  g 

s  0 

£ 

w* 

Var. 

No. 

84 

85 

86 

87 

88 

89 

90 

91 

92 

93 

94 

95 

96 

97 

98 

99 

43 

-06 

-08 

-09 

-12 

06 

-16 

-36 

04 

03 

-12 

-10 

21 

-12 

-15 

14 

-o4 

44 

05 

19 

36 

-11 

-03 

-14 

13 

11 

03 

15 

29 

00 

11 

4o 

-11 

05 

45 

16 

14 

29 

-34 

11 

38 

06 

l4 

09 

22 

39 

-03 

10 

29 

13 

17 

46 

68 

76 

86 

05 

63 

15 

54 

59 

52 

28 

75 

10 

32 

37 

42 

68 

47 

28 

40 

67 

-08 

13 

17 

43 

24 

12 

31 

65 

08 

28 

50 

-07 

27 

48 

22 

38 

48 

02 

33 

-11 

-26 

4l 

35 

07 

39 

-12 

23 

33 

37 

26 

49 

51 

64 

82 

04 

39 

06 

49 

39 

28 

16 

70 

01 

20 

31 

12 

49 

50 

31 

33 

01 

45 

53 

-03 

-08 

46 

53 

-15 

05 

-10 

29 

-09 

56 

35 

51 

01 

02 

-13 

11 

12 

-07 

05 

09 

13 

-27 

-11 

-09 

12 

-01 

10 

02 

52 

-02 

-o4 

-07 

34 

01 

06 

-4l 

10 

06 

-09 

18 

05 

08 

03 

07 

-01 

53 

27 

30 

29 

10 

22 

-23 

10 

26 

18 

40 

08 

40 

-n 

-11 

25 

27 

54 

-13 

-14 

-08 

-22 

-09 

16 

-08 

-19 

-12 

-25 

-05 

-37 

10 

22 

-19 

-14 

55 

-19 

-28 

-28 

-21 

-25 

23 

-03 

-28 

-21 

-17 

-17 

-08 

10 

-00 

-26 

-21 

56 

-15 

-19 

-29 

12 

-17 

-22 

-21 

-11 

-18 

-19 

-22 

30 

-20 

-24 

-09 

-16 

57 

47 

39 

42 

05 

31 

15 

58 

34 

29 

18 

42 

35 

24 

23 

18 

43 

58 

-10 

-13 

06 

-63 

-07 

30 

-12 

-11 

-08 

-23 

12 

-34 

-02 

31 

-05 

-09 

59 

-10 

-14 

03 

-27 

-20 

26 

20 

-13 

-12 

-13 

11 

-19 

19 

26 

-22 

-11 

60 

32 

17 

-03 

05 

39 

17 

05 

39 

48 

12 

17 

13 

4l 

3^ 

47 

33 

6l 

14 

08 

03 

04 

12 

-13 

23 

19 

18 

29 

04 

43 

10 

06 

19 

14 

62 

o4 

08 

-24 

49 

17 

-13 

-15 

14 

16 

-10 

-09 

09 

-02 

-23 

20 

06 

63 

-11 

-06 

01 

-28 

-13 

-03 

-18 

-18 

-15 

-25 

-o4 

-39 

-14 

-08 

-12 

-10 

64 

07 

23 

48 

05 

-07 

-19 

01 

07 

-06 

24 

24 

10 

02 

21 

-13 

06 

65 

34 

22 

11 

33 

23 

24 

18 

28 

34 

22 

15 

-06 

42 

13 

33 

34 

66 

-14 

-20 

-37 

25 

-22 

18 

-09 

-29 

-25 

-33 

-20 

-19 

-14 

-11 

-22 

-15 

67 

-07 

-17 

-33 

-07 

00 

15 

-22 

00 

10 

-23 

-15 

-18 

12 

04 

16 

-05 

68 

-24 

-14 

-28 

13 

-18 

-25 

-28 

-28 

-25 

-06 

-4l 

-04 

-26 

-42 

-19 

-23 

69 

o4 

o4 

o4 

-20 

-02 

-15 

26 

-15 

-10 

-07 

-16 

02 

-15 

-14 

-16 

01 

70 

-51 

-48 

-57 

-36 

-33 

-23 

-^3 

-46 

-38 

-61 

-56 

-26 

-44 

-34 

-30 

-50 

71 

-22 

-24 

-15 

-49 

-26 

11 

-06 

-29 

-25 

-30 

-14 

-19 

01 

19 

-32 

-23 

72 

-40 

-45 

-48 

02 

-42 

-45 

-22 

-36 

-37 

-08 

-55 

19 

-42 

-55 

-27 

-41 

73 

12 

20 

29 

00 

12 

-17 

-23 

18 

09 

15 

10 

18 

-00 

01 

14 

13 

74 

25 

11 

16 

06 

16 

02 

-00 

32 

30 

5^ 

10 

59 

34 

30 

26 

24 

75 

08 

19 

17 

15 

12 

-42 

-10 

18 

10 

15 

-03 

18 

-06 

-06 

11 

09 

76 

06 

05 

-12 

08 

07 

08 

-07 

01 

10 

-24 

-04 

-34 

01 

-02 

11 

07 

77 

05 

12 

-03 

16 

11 

-26 

-20 

06 

09 

-18 

-08 

-15 

-14 

-20 

12 

06 

78 

06 

12 

-03 

15 

13 

-23 

-19 

06 

10 

-27 

-07 

-21 

-13 

-19 

12 

07 

79 

00 

-20 

-11 

02 

-19 

38 

-05 

-10 

-09 

31 

00 

16 

12 

15 

-03 

-02 

80 

-12 

-22 

10 

-43 

-29 

10 

25 

-14 

-21 

19 

17 

24 

-02 

31 

-29 

-16 

81 

10 

19 

31 

-07 

10 

-27 

-27 

15 

01 

15 

08 

37 

-15 

-13 

09 

11 

82 

-l4 

-11 

-12 

-00 

-01 

-16 

08 

-09 

-08 

-22 

-10 

-17 

-14 

-20 

-05 

-13 

83 

68 

59 

44 

23 

69 

47 

17 

77 

74 

67 

56 

34 

70 

45 

71 

70 

* 

rH 

CJ\ 

w* 

100 

02 

14 

18 

63 

31 

42 

44 

4l 

07 

12 

24 

-17 

-28 

-13 

36 

-08 

-10 

37 

18 

11 

-1 6 

10 

27 

-28 

-02 

-30 

-16 

-49 

-28 

-40 

18 

30 

15 

00 

04 

04 

-09 

-10 

l4 

-09 

76 
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APPENDIX  VI 

FACTOR  ANALYSIS  OF  PREDICTOR  VARIABLES 

Factor  Loadings  for  First  Six  Computed  Factors 
After  Rotation  by  Varimax  Method  (n=26) 


Var. 

No.  Short  Description 

1  New  System 

2  New  Hardware 

3  Partic.  in  Reqt.  Anal. 

4  Partic.  in  Oper.  Design 

5  How  Well  Reqts.  Known 

6  No.  System  Changes 

7  Time  of  Peak  Changes 

8  No.  Commands 

9  No.  ADP  Centers 

10  Complexity 

11  Estimated  Instructions 

12  $  New  Instructions 

13  Table,  Constants,  Words 

14  $  Subroutine  Instructions 

15  $  Discarded  Instructions 

16  Data  Base  Words 

17  Data  Base  Classes 

18  No.  Input  Messages 

19  No.  Output  Messages 

20  $  Instruction  Innovation 

21  No.  Subprograms 

22  Maintainability 

23  Clerical.  Instructions 

24  $  Data  Reduction  Instructions 

25  io  Prediction  Instructions 

26  $  Decision  Instructions 

27  Insufficient  Memory 

28  Insufficient  i/o 

29  Timing  Constraint 

30  No.  Program  Changes 

31  Time  -  Peak  Program  Changes 

32  Language  Type 

33  No.  Program  Tools 

34  Parameter  Test  Reqts. 

35  Assembly  Test  Reqts. 


FACTOR  COEFFICIENTS 


I 

II 

III 

IV 

V 

VI 

19 

34 

-20 

-35 

21 

-44 

-20 

-05 

-24 

59 

27 

03 

-14 

22 

20 

-15 

50 

-13 

25 

48 

15 

-12 

-01 

-44 

-04 

08 

-45 

02 

02 

27 

42 

66 

-20 

-14 

-01 

27 

32 

15 

-32 

10 

-05 

-09 

28 

03 

-38 

-14 

47 

37 

-20 

-39 

-47 

4l 

50 

26 

64 

-07 

-31 

-03 

09 

-14 

79 

04 

30 

-01 

25 

-16 

08 

-28 

24 

25 

09 

20 

67 

18 

-27 

05 

31 

-21 

-09 

38 

45 

36 

28 

-46 

08 

-04 

-02 

-03 

03 

-34 

82 

15 

16 

21 

-09 

13 

88 

-16 

15 

02 

-13 

00 

88 

-07 

09 

11 

02 

-11 

92 

-08 

12 

16 

-10 

-09 

30 

01 

-24 

-24 

13 

-18 

82 

13 

03 

-17 

15 

08 

07 

l6 

-26 

-26 

70 

-08 

-23 

-l4 

35 

-47 

-32 

-06 

-16 

-20 

23 

37 

01 

05 

21 

14 

-43 

-26 

36 

-02 

27 

23 

-33 

34 

09 

-04 

36 

-19 

01 

37 

37 

-08 

52 

42 

-18 

06 

-43 

-10 

16 

53 

-28 

02 

16 

-06 

73 

25 

-16 

l4 

-03 

11 

33 

52 

-16 

-28 

-18 

-37 

-29 

03 

-21 

12 

09 

69 

6l 

60 

-04 

-17 

00 

-27 

ll 

-53 

26 

-07 

48 

18 

20 

-30 

29 

-15 

06 

-18 

NOTE:  Decimal  points  are  omitted  before  each  coefficient. 
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FACTOR  COEFFICIENTS 


Var. 


No. 

Short  Description 

I 

II 

III 

IV 

V 

VI 

36 

Reqts.  Specify  i/o 

33 

-54 

09 

-05 

15 

-16 

37 

Reqts.  for  Stop  Testing 

42 

-22 

16 

-31 

-12 

-38 

38 

No.  Internal  Documents 

30 

00 

83 

07 

15 

10 

39 

No.  External  Documents 

86 

19 

-21 

-25 

08 

06 

40 

Computer  Hours/Week 

82 

20 

11 

-o4 

08 

-o4 

4l 

Computer  Adeq.  Parameter  Test 

-57 

04 

04 

-17 

48 

19 

42 

Computer  Operation  Documented 

-54 

28 

31 

-40 

01 

12 

43 

Computer  Design  Interrupted 

23 

-36 

03 

-12 

-03 

48 

44 

Core  Size 

49 

53 

-26 

-22 

-04 

-04 

45 

No .  EDP  Components 

-14 

17 

70 

03 

01 

20 

46 

No.  Displays 

-03 

67 

34 

37 

32 

04 

47 

i/o  Equipments 

-01 

89 

06 

-25 

-07 

03 

48 

Input  Equipments 

-03 

45 

-4o 

-06 

16 

61 

49 

Output  Equipments 

-07 

81 

23 

19 

31 

07 

50 

Pieces  EAM  Equipment 

01 

-14 

-79 

23 

20 

03 

51 

EAM  Adequate 

11 

-10 

-71 

-17 

09 

-01 

52 

jo  Type  I  Programmers 

54 

-18 

-33 

00 

05 

12 

53 

jo  Type  II  Programmers 

13 

02 

22 

78 

-03 

09 

54 

’jo  Type  III  Programmers 

-49 

27 

02 

-39 

-12 

-02 

55 

jo  Type  IV  Programmers 

-20 

-32 

03 

-69 

16 

01 

56 

Type  I  Programmer  Experience 

51 

-38 

-11 

-05 

33 

-15 

57 

Type  II  Programmer  Experience 

51 

28 

22 

12 

13 

-42 

58 

Type  III  Programmer  Experience 

-03 

06 

71 

-36 

19 

08 

59 

Type  IV  Programmer  Experience 

50 

08 

-02 

-74 

14 

07 

60 

jo  Programmers  in  Ops  Design 

-08 

-10 

-15 

o4 

17 

-59 

61 

jo  Programmers  in  Ops  and  Prg  Design  05 

07 

00 

i4 

01 

-70 

62 

jo  Programmers  in  Program  Design 

-21 

-19 

-4l 

47 

23 

-48 

63 

jo  Programmers  in  Whole  Job 

-08 

-17 

28 

-26 

55 

48 

64 

No.  Terminations  per  Month 

58 

42 

-07 

-13 

-01 

4l 

65 

No.  Hires  per  Month 

45 

-09 

-12 

-20 

08 

18 

66 

System  Design  Documented 

14 

-43 

22 

04 

4o 

-35 

67 

Program  Design  Documented 

19 

-53 

01 

-29 

57 

-19 

68 

Computer  Use  Documented 

03 

-55 

-23 

25 

28 

29 

69 

Unavailable  Computer  Documented 

01 

-12 

37 

05 

-04 

-03 

70 

Communicating  Agency  Documented 

-20 

-60 

02 

-21 

43 

-05 

71 

Concurring  Agency  Documented 

-16 

-10 

07 

-78 

-02 

09 

72 

Cost  Control  Documented 

02 

-62 

-14 

-07 

06 

03 

73 

Management  Control  Documented 

28 

-05 

19 

o4 

54 

31 

74 

Document  Control  Documented 

45 

07 

-10 

-03 

-52 

-20 

75 

Standards  Documented 

21 

08 

-18 

21 

49 

-07 

76 

No.  Concurring  Agencies 

27 

-22 

-01 

19 

71 

-23 

77 

No.  Experienced  Agencies 

-03 

-19 

-17 

46 

73 

-07 

78 

No.  Decision  Agencies 

15 

-17 

-17 

38 

73 

-08 

79 

Schedule  Slipped 

-04 

-17 

46 

00 

-50 

08 

80 

Computer  Operated  by  Other 

19 

38 

46 

-39 

-50 

-32 

81 

Program  Developed  at  Other  Site 

25 

-12 

19 

17 

13 

68 

82 

Program  Developed  at  Several  Sites 

-32 

21 

-06 

-02 

-12 

-12 

83 

Trip  Mileage 

88 

-03 

02 

-01 

04 

-05 
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SUMMARY  OF  REGRESSION  ANALYSES 
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APPENDIX  VII— SUMMARY  OF  CORRELATION  AND  REGRESSION  ANALYSES 


INTRODUCTION 


The  following  eight  tables  are  summaries  of  the  results  of  the  correlation 
and  regression  analyses.  Each  potential  predictor  variable  considered  in 
the  final  regression  analysis  is  listed  with  a  short  description  of  the 
variable  (a  more  complete  description  will  be  found  in  Appendix  II). 
Various  statistical  relationships  of  the  variables  are  presented  as  well 
as  the  final  regression  equation. 


TABLE  1 

SUMMARY  OF  CORRELATION  AND  REGRESSION  ANALYSIS  FOR  COST  VARIABLE  6k 


Number 

of  Man  Months 

to  Design,  Code 

and  Test 

Variable  Number 

11 

10  39 

38 

16 

26 

6k 

Est. 

d/b 

$ 

Terms. 

Short  Description 

Instr. 

Complex-  Ext. 

Int. 

Words 

Decis. 

Per 

( 1000 ' s ) 

ity  Docts. 

Docts. 

a°6ln 

)  Instr. 

Month 

Means 

5U.8 

3.2 

5.6 

3.8 

3.0 

28.1 

.8 

Standard  Deviations 

66.2 

•  9 

4.0 

3.6 

2.2 

18.8 

1.0 

Validity  Coefficients 

.89 

.67 

.78 

.65 

.36 

•  ^3 

Inter correlations 

Variable  Number 

n 

1.00 

.50 

.67 

.59 

.58 

.24 

•  32 

10 

.50 

1.00 

•  55 

.05 

.33 

.42 

.09 

39 

.67 

.55 

1.00 

.06 

.56 

•  35 

.66 

38 

.59 

.05 

.06 

1.00 

.35 

-.06 

.04 

16 

.58 

.33 

.56 

.35 

1.00 

.11 

.63 

26 

.24 

.42 

.35 

-.06 

.n 

1.00 

.07 

64 

•  32 

.09 

.66 

.04 

.63 

.07 

1.00 

Standardized  Regression 

.46 

.25 

.21 

.12 

.11 

.07 

.05 

Coefficients  (7  variables) 

Standardized  Regression 

.45 

.26 

.26 

.11 

.12 

not 

not 

Coefficients  (5  variables) 

Multiple  Correlation  Coefficient 

.95 

Number 

of  Data  Points 

se¬ 

lected 

se¬ 

lected 

Mean  of  Cost  Variable  300  Standard  Error  of  Estimate  138 

Standard  Error  of  Prediction  at  the  Mean  1^1  Standard  Deviation  of  Cost  Variable  397 
95$  Confidence  Limits  at  the  Mean*  +295  Man  Months 


PREDICTION  EQUATION:  Yg^  =  2.7^  +  121^  +  26X39  +  12X38  +  22Xl6  “  1+97 


*These  limits  will  expand  as  predictions  deviate  from  the  mean. 
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TABLE  2 


SUMMARY  OF  CORRELATION  AND  REGRESSION  ANALYSIS  FOR  COST  VARIABLE  87 


Months  Elapsed 


Variable  Number 


Short  Description 


13 

44 

26  39 

11 

64 

10  16 

Wds.  in 

Tbls.  & 

$  Ext. 

Est. 

Term. 

D/B 

Consts. 

Core 

Decis.  Doc. 

Inst. 

Per 

Complex-  Wds. 

(toG10) 

Jnstr.  Types 

(thous. 

)  Mo. 

ity  (Log10) 

Means 

3.8 

35-9 

28.1 

5.6 

54.8 

.8 

3.2 

3.0 

Standard  Deviations 

1.6 

19.0 

18.8 

u.o 

66.2 

1.0 

.9 

2.2 

Validity  Coefficients 

.62 

.05 

.53 

.39 

.32 

.22 

.35 

.24 

Intercorrelations 

Variable  Number 


13 

1.00 

.49 

•  38 

.66 

.66 

.44 

•  53 

.48 

44 

.49 

1.00 

.17 

.67 

•  31 

.70 

.23 

•  37 

26 

.38 

•  IT 

1.00 

•  35 

.24 

.07 

.42 

.11 

39 

.66 

.67 

•35 

1.00 

•  67 

.66 

•55 

.56 

11 

.66 

•  31 

.24 

.67 

1.00 

•  32 

.50 

•  58 

64 

.44 

•  70 

•07 

.66 

•32 

1.00 

.09 

.63 

10 

•  53 

•  23 

.42 

•55 

.50 

•  09 

1.00 

•  33 

16 

.48 

•  37 

.11 

.56 

•58 

.63 

•33 

1.00 

Standardized  Regression 

Coefficients  (8  variables) 

•  74 

-.61 

.34 

.29 

-.27 

.23 

-.07 

-.06 

Standardized  Regression 

Coefficients  (4  variables) 

•59 

-.40 

.32 

.16 

not  not  not 

se-  se-  se¬ 

lected  lected  lected 

not 

se¬ 

lected 

Multiple  Correlation  Coefficient 


4.9  Number  of  Data  Points 


26 


Mean  of  Cost  Variable 


16,3  Standard  Error  of  Estimate 


4.8 


Standard  Error  of  Prediction  at  the  Mean  4.9  Standard  Deviation  of  Cost  Variable  6.8 


95$  Confidence  Limits  at  the  Mean*  +10.2  Months 


PREDICTION  EQUATION: 


Y87  =  2.5^3  -  .1^  +  .11X26  +  .3X39  ♦  7.0 


*These  limits  vill  expand  as  predictions  deviate  from  the  mean. 
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TABLE  3 


SUMMARY  OF  CORRELATION  AND  REGRESSION  ANALYSIS  FOR  COST  VARIABLE  88 

Computer  Hours  Used 


Variable  Number 

11 

O 

i— 1 

64 

38 

26 

Est. 

d/b 

Terms . 

Int. 

$ 

Short  Description 

Instr. 

Complex-  Wds . 

Per 

Doct. 

Decis. 

(1000* 

s)  ity  (Log10 

)  Month 

Types 

Instr. 

Means 

54.8 

3-2 

3.0 

.8 

3.8 

28.1 

Standard  Deviations 

66.2 

•9 

2.2 

1.0 

3-6 

18.8 

Validity  Coefficients 

•  87 

.70 

.64 

•  39 

.45 

•  36 

Intercorrelations 

Variable  Number 

11 

1.00 

•50 

.58 

.32 

•59 

.24 

10 

.50 

1.00 

•  33 

•  09 

.05 

.42 

16 

.58 

.33 

1.00 

.63 

•35 

.11 

6h 

•  32 

•  09 

.63 

1.00 

.04 

.07 

38 

•59 

.05 

•35 

.04 

1.00 

-.06 

26 

.24 

.42 

.11 

.07 

-.06 

1.00 

Standardized  Regression 

Coefficients  (6  variables)  .52 

.37 

.11 

.11 

.09 

.07 

Standardized  Regression 

Coefficients  (3  variables)  .59 

.35 

.18 

not 

not 

not 

se¬ 

lected 

se¬ 

lected 

se¬ 

lected 

Multiple  Correlation  Coefficient 

.94  Number  of 

Data  Points 

Mean  of  Cost  Variable  1482  Standard  Error  of  Estimate  905 

Standard  Error  of  Prediction  at  the  Mean  923  Standard  Deviation  of  24l0 

Cost  Variable 

95$  Confidence  Limits  at  the  Mean*  ±1911  Hours 


PREDICTION  EQUATION:  Yqq  =  21. 5*^  +  985X30  +  197^  -  3468 


*  These  limits  will  expand  as  predictions  deviate  from  the  mean. 
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TABLE  4 


SUMMARY  OF  CORRELATION  AND  REGRESSION  ANALYSIS  FOR  COST  VARIABLE  90 
Delivered  Instructions  (in  thousands) 


Variable  Number 

11 

18 

bb 

72 

38 

8 

Est . 

Input 

Cost 

Int. 

No. 

Short  Description 

Instr. 

Mess. 

Core 

Contrl. 

,  Doc. 

of 

(1000' 

s)  Types 

Doc . 

Types 

Comds . 

Means 

54.8 

9.0 

35-9 

•5 

3.8 

2.8 

Standard  Deviations 

66.2 

16.4 

19.0 

.5 

3.6 

1.8 

Validity  Coefficients 

•  94 

.90 

M 

-.03 

.44 

.17 

Intercorrelations 

Variable  Number 

11 

1.00 

.83 

•  31 

-.14 

.59 

.lU 

18 

•  83 

1.00 

.bo 

•05 

.41 

.13 

bb 

•  31 

.4o 

1.00 

-.26 

-.10 

.07 

72 

-.14 

•  05 

-.26 

1.00 

-.25 

.13 

38 

•59 

.4i 

-.10 

-25 

1.00 

-.19 

8 

.14 

•  13 

.07 

•  13 

-19 

1.00 

Standardized  Regression 

Coefficients  (6  variables) 

•  72 

.26 

.lU 

.08 

-.06 

.00 

Standardized  Regression 

Coefficients  (3  variables) 

.63 

•  33 

.lh 

not 

not 

not 

se¬ 

se¬ 

se¬ 

lected 

lected 

lected 

Multiple  Correlation  Coefficient 

*97 

Number 

of  Data  Points 

Mean  of  Cost  Variable  59*6  Standard  Error  of  Estimate  19.0 

Standard  Error  of  Prediction  at  the  Mean  19.4  Standard  Deviation  of  Cost  75.2 

Variable 

95$  Confidence  Limits  at  the  Mean*  +40.2  No.  Instruc.  (Thous. ) 


PREDICTION  EQUATION:  Y^  =  .7Xi;l  +  1-5^  +  .5X^  -  12.0 


*Thesc  limits  will  expand  as  predictions  deviate  from  the  mean. 
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TABLE  5 

SUMMARY  OF  CORRELATION  AND  REGRESSION  ANALYSIS  FOR  COST  VARIABLE  96 
Number  of  External  Document  Pages  (in  hundreds) 


Variable  Number 

18 

39 

11 

8 

72 

5 

Input 

Ext. 

Est . 

No. 

Cost 

How  well 

Short  Description 

Mess. 

Doc . 

Instr. 

of 

Cont. 

Req/ts 

Types 

Types 

(1000' s 

)  Comds.  Doc. 

Known 

Means 

9-0 

5-6 

54.8 

2.8 

•5 

2.k 

Standard  Deviations 

16.4 

4.0 

66.2 

1.8 

•5 

.8 

Validity  Coefficients 

.87 

.71 

•  78 

.24 

-.04 

-.10 

Intercorrelations 

Variable  Number 

18 

1.00 

.68 

.83 

•  13 

•05 

-•17 

39 

.68 

1.00 

.67 

•  35 

-.16 

.06 

11 

.83 

.67 

1.00 

.14 

-.14 

-.27 

8 

.13 

•  35 

.14 

1.00 

•  13 

•  34 

72 

•  05 

-.16 

-.14 

•  13 

1.00 

•35 

5 

-.17 

.06 

-.27 

•  34 

•35 

1.00 

Standardized  Regression 


Coefficients  (6  variables ) .68 

.12  .13 

•09 

-.06 

.03 

Standardized  Regression 

Coefficients  (2  variables ) .72 

.22  not 

not 

not 

not 

se¬ 

se¬ 

se¬ 

se¬ 

lected 

lected 

lected 

lected 

Multiple  Correlation  Coefficient 
Mean  of  Cost  Variable 


Variable 

95$  Confidence  Limits  at  the  Mean*  _+30.4  No.  pages  (Hundreds) 


.89 

Number  of  Data  Points 

26 

16.8 

Standard 

Error  of  Estimate 

14.4 

14.7 

Standard 

Deviation  of  Cost 

29.8 

PREDICTION  EQUATION:  Y^g  =  1-3\q  +  1-TX^  -  b.2 


*These  limits  will  expand  as  predictions  deviate  from  the  mean. 
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TABLE  6 


SUMMARY  OF  CORRELATION  AND  REGRESSION  ANALYSIS  FOR  COST  VARIABLE  99 

Sum  of  All  Man  Months 


Variable  Number 

11 

39 

10 

38 

16 

26 

64 

Est. 

Ext. 

Int . 

d/b 

Term's. 

Short  Description 

Instr. 

Doc.  Complex - 

Doc . 

Wds . 

Decis. 

Per 

(1000's 

)  Types  ity 

Types  (Log10)  Instr. 

Mo. 

Means 

5U.8 

5.6 

3.2 

3.8 

3.0 

28.1 

.8 

Standard  Deviations 

66.2 

4.0 

•  9 

3.6 

2.2 

18.8 

1.0 

Validity  Coefficients 

.87 

.77 

.68 

.44 

.65 

.38 

.41 

Intercorrelations 

Variable  Number 

11 

1.00 

.67 

.50 

•59 

•  58 

.2k 

•  32 

39 

.67 

1.00 

•  55 

.06 

.56 

•35 

.66 

10 

.50 

.55 

1.00 

.05 

•  33 

.42 

.09 

38 

•  59 

.06 

•  05 

1.00 

•  35 

-.06 

.04 

16 

•58 

.56 

•  33 

•35 

1.00 

.11 

.63 

26 

.24 

•35 

.42 

-.06 

.11 

1.00 

.07 

64 

•  32 

.66 

.09 

.04 

•  63 

.07 

1.00 

Standardized  Regression 

Coefficients  (7  variables) 

•  39 

.26 

.26 

.14 

.12 

.08 

.00 

Standardized  Regression 

Coefficients  (5  variables) 

.4o 

.28 

.28 

.14 

.12 

not 

not 

se- 

gc- 

lected  lected 

Multiple  Correlation  Coefficient 

.$* 

Number 

of  Data  Points 

26 

Mean  of  Cost  Variable 

373 

Standard  Error  of 

Estimate 

182 

Standard  Error  of  Prediction  at  the  Mean  186  Standard  Deviation  of  Cost  492 

Variable 

95$  Confidence  Limits  at  the  Mean*  +389  Man  Months 


PREDICTION  EQUATION:  Y 9g  =  3-OX^  -  35XJ9  +  l64xi0  +  +  26xi6  -  658 


*These  limits  will  expand  as  predictions  deviate  from  the  mean. 
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TABLE  7 


SUMMARY  OF  CORRELATION  AND  REGRESSION  ANALYSIS  FOR  COST  VARIABLE  100 


Man  Months  for  Changes 

Variable  Number 

18 

11 

13 

38 

23 

10 

2  6  8 

Short  Description 

Input 

Mess. 

Types 

Est. 

Instr. 

(1000's) 

Wds.  in 
Tbls.  & 
Const. 
(L°Si0) 

Int . 
Doc. 
Types 

* 

Cler. 

Instr. 

Complex¬ 

ity 

$  No. 

Decis.  of 
Instr.  Comds 

Means 

9.0 

54.8 

3*8 

3.8 

31*3 

3.2 

28.1 

2.8 

Standard  Deviations 

16.4 

66.2 

1.6 

3.6 

22.3 

.9 

18.8 

1.8 

Validity  Coefficients 

.86 

.68 

.6l 

.39 

-.11 

•  55 

.17 

.18 

Intercorrelations 

Variable  No. 

18 

1.00 

*83 

.66 

.4l 

-.27 

.59 

.34 

*13 

11 

.83 

1.00 

.66 

.59 

-.21 

•  50 

.24 

.14 

13 

.66 

.66 

1.00 

.03 

-.49 

•  53 

•  38 

.4l 

38 

.41 

.59 

.03 

1.00 

•  23 

.05 

-.06 

-.19 

23 

-.27 

-.21 

-.49 

.23 

1.00 

-.30 

-.49 

-.52 

10 

.59 

*50 

.53 

.05 

-.30 

1.00 

.42 

*34 

26 

.34 

.24 

.38 

-.06 

-.49 

.42 

1.00 

.18 

8 

.13 

.lb 

.41 

-.19 

-.52 

•  34 

.18 

1.00 

Standardized  Regression 

.91 

-.44 

•  30 

.23 

.17 

.13 

-.12 

.10 

Coefficients  (8  variables) 

Standardized  Regression 

•  78 

not 

•  19 

not 

.19 

not 

not 

not 

Coefficients  (3  variables) 

se¬ 

se- 

se- 

se¬ 

se¬ 

lected 

lected 

lected 

lected 

lected 

Multiple  Correlation  Coefficient 

.94 

Number 

of  Data 

Points 

26 

Mean  of  Cost  Variable 

373 

Standard  Error 

of  Estimate 

182 

Standard  Error  of  Prediction 

at  the  Mean  186 

Standard  Deviation  of 

Cost 

492 

Variable 

95$  Confidence  Limits  at  the  Mean*  +389  Man  Months 


PREDICTION  EQUATION:  Y1QQ  =  iO.^g  +  27^  +  1.9>^  -  174 


*These  limits  will  expand  as  predictions  deviate  from  the  mean. 
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TABLE  8 

SUMMARY  OF  CORRELATION  AND  REGRESSION  ANALYSIS  FOR  COST  VARIABLE  90 
Number  of  Delivered  Instructions  (in  thousands) 

(Alternate  Solution  Without  Using  Estimated  Instructions  as  a  Predictor) 


Variable  Number 

18 

21 

13 

16 

44 

5 

Short  Description 

Input 

Mess. 

Types 

No.  of 

Sub- 

Progs 

Wds  in 
Tables  & 
Const. 

(L°gi0) 

d/b 

Words 

(l°Giq) 

Core 

How  well 

Reqts. 

Known 

Means 

9-0 

24.5 

3-8 

3.0 

35-9 

2.4 

Standard  Deviations 

16.4 

23-0 

1.6 

2.2 

19.0 

.8 

Validity  Coefficients 

•  90 

.83 

.71 

.62 

.46 

-.07 

I nt  er corr  elat i ons 

Variable 

Number 

18 

1.00 

.69 

.66 

.69 

.40 

-.17 

21 

.69 

1.00 

.60 

.61 

.55 

.05 

13 

.66 

.60 

1.00 

.48 

.49 

-.04 

16 

.69 

.61 

.48 

1.00 

.37 

.04 

44 

.40 

.55 

.^9 

.37 

1.00 

.20 

5 

-.17 

.05 

-.04 

.04 

.20 

1.00 

Standardized  Rec 

pression  .64 

.39 

.13 

-.11 

-.05 

.04 

Coefficients  ( 

6  variables) 

Standardized  Re* 

press!  on  .58 

.36 

.12 

not 

not 

not 

Coefficients  ( 

, 3  variables) 

se- 

se- 

se- 

lected  lected  lected 


Multiple  Correlation  Coefficient 

.88 

Number  of  Data  Points 

26 

Mean  of  Cost  Variable 

81 

Standard 

Error  of  Estimate 

113 

Standard  Error  of  Prediction  at  the  Mean 

115 

Standard 

Deviation  of  Cost 

219 

Variable 

95$  Confidence  Limits  at  the  Mean  +238  Man  Months 


PREDICTION  EQUATION:  Y^  =  2.6^  +  1.2X21  +  5-6^  -  13-9 


*These  limits  will  expand  as  predictions  deviate  from  the  mean. 
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